The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Imbalanced dataset
Hi,
Most of datasets are imbalanced, in proportions of more or less 1/10, 10% positive class and the rest 90% negative class. I am really stuck with this problem, I’ve been reducing that negative examples in order to match the positive ones, it gives me datasets that are not representative of the whole set. How to deal with this in Rapidminer.
Please help, I know in some detail many Machine learning techniques, but at the time of using them I don’t get good results.
Thanks
Most of datasets are imbalanced, in proportions of more or less 1/10, 10% positive class and the rest 90% negative class. I am really stuck with this problem, I’ve been reducing that negative examples in order to match the positive ones, it gives me datasets that are not representative of the whole set. How to deal with this in Rapidminer.
Please help, I know in some detail many Machine learning techniques, but at the time of using them I don’t get good results.
Thanks
0
Answers
Having an unbalanced dataset is indeed the normal case. Your success depends on the learning
scheme you apply and the performance criteria you consider.
And last but not least RapidMiner offers some kind of sampling operators that allow to re-balance.
I'm using the dataset of that Kaggle competition "Don't Get Kicked!" (https://www.kaggle.com/c/DontGetKicked) in which the dataset is imbalanced in the proportion of aprox. 1 to 7. For this problem I have used Over-Sampling, for the minority label I used bootstrap to generate as many positive examples as the dataset with negative ones, and for the performance criteria I used a kind of F1 score.
You mentioned that in Rapidmier there are some operators that facilitate that task, please tell me which ones, and any advise welcome.
Thanks you very much