I am not sure how to insert an image - I got a classification model showing an ideal ROC curve so it should have an AUC (area under ROC curve) equal to 1; however RM displays an AUC of 0.5. This seems to be a bug.
Sebastian, please find the process here appended. It contains generated data and data sampling for model evaluation - so randomness is involved, theoretically speaking. However one expects practically you to get an ideal confusion matrix (accuracy=1) and an ideal ROC but, surprisingly, with an AUC=0.5. If you do not get this, you may wish me to email you the image files with the ROC curve and the confusion matrix.
IngoRMAdministrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts: 1,751 RM Founder
Hi Dan,
I understand that this seems hard to believe but as far as I can see the calculation is indeed correct:
if you only have the reference points (0,0) and (1,1) the trapecoidal calculation of the AUC will deliver exactly the half of rectangle which results in 0.5
the optimisting calculation is also easy to understand: here the upper bounds for each rectangle are used and this results in 1.0
the one thing which might surprise why the pessimistic calculation also results in 1 and hence is better than 0.5: but here the lower rectangles are used - which in this case is exactly the same rectangle like in the optimistic case
Thanks for the explanation. Actually the ROC curve in this case contains the point (0,1), the so called "perfect classification" - see http://en.wikipedia.org/wiki/Receiver_operating_characteristic So you have the points (0,1) and (1,1) in the curve graph.
You can see also the drawing of the ROC produced by RM in this case: indeed the area under this curve is 1. Therefore AUC indicator should be calculated to 1.
Moreover, please note that an AUC of 0.5 is achieved in general by the random classifiers (which provide for instance an equal number of good and bad answers - assuming we have the positive and negative classes of the same size). This is improper for the particular decision tree I provided - which happens to be a perfect classifier (accuracy=1).
Also, it is widely accepted that AUC is one of the indicators of the quality of a binary classifier. As said, the above decision tree is a perfect classifier, so it is natural it to have the highest AUC as opposed to an AUC=0.5.
So everything indicates that AUC should be calculated to be 1 here. This would be consistent also with the optimistic and pessimistic calculations.
Best, Dan
0
IngoRMAdministrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts: 1,751 RM Founder
Hi,
You can see also the drawing of the ROC produced by RM in this case: indeed the area under this curve is 1. Therefore AUC indicator should be calculated to 1.
ok, that's weird, I didn't check this. If the point (0,1) is also part of the thresholds (are you sure it is or is it just the painting?) then indeed I would also expect the AUC to be 1. You could file a bug in our community bug tracker in this case.
Given that the author of the above code has recently posted thus..
However, perhaps this suggestion may be useful to consider after the ROC Analysis implemented in Rapid Miner would be revised as it is still unreliable in this package (i.e. AUC calculation needs corrections, as I have shown on the forum http://rapid-i.com/rapidforum/index.php?PHPSESSID=18d6261d2d63b2ca946477f03c2552bc&;topic=2237.0 , and Find Threshold operator does not find the best threshold as expected but provides suboptimal solutions - I emailed a complete report to the RM development team, with relevant processes illustrating this).
I took another look at the code and noticed that we have binominal mapping/remapping of the label, without view creation, which changes the underlying data, generates an error, and is not necessary for the learner, like this...
As one of the participants at this discussion asked
am I missing something?
- yes, perhaps understanding the essential thing. RM still makes this AUC calculation error 2 years after. Toodle Pip.
Dan
PS By the way AUC is the area under the ROC curve. As reported to RapidI team some time ago, RM produces some wrong results within the ROC analysis too.
Answers
could you post the process generating this data here?
Greetings,
Sebastian
Regards
Dan
I understand that this seems hard to believe but as far as I can see the calculation is indeed correct:
- if you only have the reference points (0,0) and (1,1) the trapecoidal calculation of the AUC will deliver exactly the half of rectangle which results in 0.5
- the optimisting calculation is also easy to understand: here the upper bounds for each rectangle are used and this results in 1.0
- the one thing which might surprise why the pessimistic calculation also results in 1 and hence is better than 0.5: but here the lower rectangles are used - which in this case is exactly the same rectangle like in the optimistic case
Cheers,Ingo
Thanks for the explanation. Actually the ROC curve in this case contains the point (0,1), the so called "perfect classification" - see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
So you have the points (0,1) and (1,1) in the curve graph.
You can see also the drawing of the ROC produced by RM in this case: indeed the area under this curve is 1. Therefore AUC indicator should be calculated to 1.
Moreover, please note that an AUC of 0.5 is achieved in general by the random classifiers (which provide for instance an equal number of good and bad answers - assuming we have the positive and negative classes of the same size). This is improper for the particular decision tree I provided - which happens to be a perfect classifier (accuracy=1).
Also, it is widely accepted that AUC is one of the indicators of the quality of a binary classifier. As said, the above decision tree is a perfect classifier, so it is natural it to have the highest AUC as opposed to an AUC=0.5.
So everything indicates that AUC should be calculated to be 1 here. This would be consistent also with the optimistic and pessimistic calculations.
Best,
Dan
Cheers,
Ingo
Given that the author of the above code has recently posted thus.. here...
http://rapid-i.com/rapidforum/index.php/topic,2584.msg10537.html#msg10537
I took another look at the code and noticed that we have binominal mapping/remapping of the label, without view creation, which changes the underlying data, generates an error, and is not necessary for the learner, like this... Disable these operators, and the warnings disappear, and rather a different result emerges , like this...
Or am I missing something?
Toodle Pip!
Just checked and this error of RM in calculating AUC has not been corrected since this was posted.
Here is a recall. http://rapid-i.com/rapidforum/index.php/topic,6871.msg24166.html#msg24166
As one of the participants at this discussion asked - yes, perhaps understanding the essential thing. RM still makes this AUC calculation error 2 years after. Toodle Pip.
Dan
PS By the way AUC is the area under the ROC curve. As reported to RapidI team some time ago, RM produces some wrong results within the ROC analysis too.