10-22-2016 04:10 PM
I was given a labelled data set and I was told few of the labels are wrongly assigned, i.e. some of the data were graded inaccurately. I'm supposed to find which ones. Which tool in RapidMiner should I use?
I tried the operator Find Outliers (Density), but somehow I feel that is not the one I'm looking for.
Thank you very much for advice. Markéta
10-24-2016 01:41 PM
Here is an idea: you could train a model on the data set which is generalizing well (no overfitting, no k-nn with 1 neighbor only, you get the idea...) and then apply this model to the training data set again. Whenever the prediction differs from the label, this could be a good candidate for wrongly labeled.
Just my 2c,
10-25-2016 10:54 AM
Another potenial approach would be to run a clustering analysis on the labeled classes separately and then look for individual outliers that way.