ROC curve from models using k-nearest neighbor algorithms

xiaodongxiaodong Member Posts: 2 Contributor I
edited November 2018 in Help
Hi,

I am confused by the ROC curve obtained from models using k-nearest neighbor algorithms(k-nn) in rapidminer. Since there is no determinant (usually denoted as tau in machine learning I guess) in k-nn, how could rapidminer tune the threshold between (two) different classes? Is there any special implementation?

Thanks a lot!

Best wishes,
xiaodong

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Xiaodong,

    I don't get your point - what's the matter with the ROC-Plots?

    On application, the k-NN model creates confidences based on the mean of the k nearest neighbors. For k=1, the confidences can be only 0 or 1, leading to an ugly, but correct ROC plot. For larger values of k, the granularity of the confidence values increases.

    Best,
    Marius
  • xiaodongxiaodong Member Posts: 2 Contributor I
    Hi Marius,

    Thank you for your answer. I didn't understand how the ROC plot was created for k-nn. Since in this algorithm, there is no decision threshold at all and the instance will be classified as the dominant class of k nearest instances to it, how can rapidminer tune the decision thresholds to get the curve (  there should be only one dot if I am correct).

    Best,
    Xiaodong
  • wesselwessel Member Posts: 537 Maven
    An ROC plot is created by sorting all predictions based on their confidence.
    You then start in the bottom left corner, and move right for every mistake, and move up for every correct classification.
    You also do the ROC the other way around starting from the top right corner going down and left.
    This results in two curves; the pessimistic and optimistic curve, typically you average both curves.
    Note that rapid miner will show you both the optimistic and pessimistic curve.

    You don't run classifiers for various confidence thresholds to create an ROC curve, although it is possible to do this, but extremely computationally expensive.
    Although I don't see why this is different for k-NN.
    For example if you have 3-NN you can make a rule; only classify as False if 3 neighbors are of class False.
    A less sensitive rule would only require 2 out of 3 neighbors to be of class False.
    Alternatively, you can devise some metric that takes into account the distance of each neighbor to compute a level of confidence.

    For multinomial classification problems, you have even more curves, two curve for each class.
    If you want to compute the area under the curve you first compute the area for each class, and then aggregate to get the mean area of all classes.
    Note that in rapid miner you have to convert the problem by hand to 1-against-all classification, to get ROC for multinomial classification problems.

    Finally, please note that rapid miner show a colored background to show the boundaries of different curves as obtained using cross validation.
    Its easier to make sense of this if you set the number of folds used in cross validation to '2'.
    In this scenario the average is always exactly in the middle of the 2 boundaries.

    Best regards,

    Wessel
Sign In or Register to comment.