Options

# ROC true positve rate remains at 0 for some time before going up. Unusual.

Hi everyone,

I'm pretty new to data mining and RapidMiner so take it easy on me .

I'm dealing with a binary classification problem where I'm trying to identify people at high risk for a certain condition. 1 = yes 2= no

I'm using various sizes of data (in terms of observations) averaging around 160,000 observation. the data set contains 22 attributes (nominal/polunominal/numerical) and the binominal class label as described above. I'm comparing different classification algorithms for this problem which are listed in the table below. All experiments used a 5-fold cross validation with a binominal classification performance operator to get the results.

THE PROBLEM

The J48 Decision tree from the WEKA extension provides promising results as seen in the provided results table below, however, the AUC does not seem correct (see table below). When looking at the plot of the ROC curve at the bottom left corner of the chart the true positive rate remains at 0 for a little as the false positive rate increases along the x-axis. at about .5 along the x-axis the true positive rate finally increases and eventually goes above the y=x line. This is clearly why the AUC suffers but I do not know why this is happening and this does not occur in any other algorithm. (all data has been prepossessed to remove missing values and under-sampling has been implemented with some additional steps as well.)

If anyone knows why this could be occurring your help would be greatly appreciated, thank you.

AUC Sensitivity Specificity F-Measure Accuracy

Logistic Regression (WEKA LR) 0.715 65.70% 65.28% 65.56% 65.49%

C4.5 Decision Tree (WEKA J48) 0.678 67.99% 63.58% 66.52% 65.78%

Random Forest (WEKA RF) 0.704 63.89% 65.17% 64.30% 64.53%

Support Vector Machine 0.710 70.49% 59.87% 66.94% 65.18%

Neural Network 0.713 72.25% 57.14% 66.81% 64.70%

Radial Basis Function Network 0.654 62.96% 59.07% 61.67% 61.01%

K-NN 0.500 52.27% 52.71% 52.38% 52.49%

Naïve Bayes 0.689 59.14% 68.41% 62.01% 63.77%

I'm pretty new to data mining and RapidMiner so take it easy on me .

I'm dealing with a binary classification problem where I'm trying to identify people at high risk for a certain condition. 1 = yes 2= no

I'm using various sizes of data (in terms of observations) averaging around 160,000 observation. the data set contains 22 attributes (nominal/polunominal/numerical) and the binominal class label as described above. I'm comparing different classification algorithms for this problem which are listed in the table below. All experiments used a 5-fold cross validation with a binominal classification performance operator to get the results.

THE PROBLEM

The J48 Decision tree from the WEKA extension provides promising results as seen in the provided results table below, however, the AUC does not seem correct (see table below). When looking at the plot of the ROC curve at the bottom left corner of the chart the true positive rate remains at 0 for a little as the false positive rate increases along the x-axis. at about .5 along the x-axis the true positive rate finally increases and eventually goes above the y=x line. This is clearly why the AUC suffers but I do not know why this is happening and this does not occur in any other algorithm. (all data has been prepossessed to remove missing values and under-sampling has been implemented with some additional steps as well.)

If anyone knows why this could be occurring your help would be greatly appreciated, thank you.

AUC Sensitivity Specificity F-Measure Accuracy

Logistic Regression (WEKA LR) 0.715 65.70% 65.28% 65.56% 65.49%

C4.5 Decision Tree (WEKA J48) 0.678 67.99% 63.58% 66.52% 65.78%

Random Forest (WEKA RF) 0.704 63.89% 65.17% 64.30% 64.53%

Support Vector Machine 0.710 70.49% 59.87% 66.94% 65.18%

Neural Network 0.713 72.25% 57.14% 66.81% 64.70%

Radial Basis Function Network 0.654 62.96% 59.07% 61.67% 61.01%

K-NN 0.500 52.27% 52.71% 52.38% 52.49%

Naïve Bayes 0.689 59.14% 68.41% 62.01% 63.77%

Tagged:

0

## Answers

6Contributor II106MavenNormally this happens when a model assigns the top probabilities for the positive class to some examples that are actually in the negative class.

I used the word "normally" as this would happen in a correct implementation of ROC curves and ROC analysis. However, from my experience, ROC analysis is unreliable in RapidMiner, including in the latest non-free professional version 6. So I would not use whatever is related to ROC curves from RapidMiner in my analyses, even if I would pay $2999+ per year for this software.

See this for some related discussion

http://rapid-i.com/rapidforum/index.php/topic,7502.0.html

Dan