🎉 🎉 RAPIDMINER 9.10 BETA IS OUT!!! 🎉🎉

GRAB THE HOTTEST NEW BETA OF AI HUB/SERVER, STUDIO, AND RADOOP. LET US KNOW WHAT YOU THINK!

CLICK HERE TO DOWNLOAD

Newbie - expected performance output -after using the sample operator

AmsDaniAmsDani Member Posts: 3 Contributor I
Hi, sorry for the beginners question... I have a data set with 30,000 lines. The target variable is imbalanced : total false: 24000 / total true: 6000. So I have used the operator "sample" to balance it ( 1000 each) . At the end the performance classification operator gives the confusion matrix with only 2000 results ( from the sample). I was expecting the evaluation ( totals per TP/ TN/ FP/ FN) based on the total lines of the entire dataset ( 30,000 in total ) in order to evaluate costs as well ( on the performance costs operator ).  What have I missed ? Maybe the issue is in the wrong lines used for the  input/ outputs connectors ? Any tips where it can go wrong? I have tried many ways.... Thanks in advance for your help!

Best Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391   Unicorn
    Solution Accepted
    As you selected only 2000 examples for model building and validation, this is what you get in the confusion matrix. However, since you use cost as a method of model evaluation, you can also use a cost sensitive model to deal with class imbalance, e. g. decision tree. I assume the cost if misclassifying the minority class is high (e. g. positive case, when representing fraud) and the cost of misclassifying the majority class is low (negative case). When cost structure is set up in this way, in model training, the importance of the majority class can be weighed down in favour of the minority class, thus overcoming the problem of class imbalance. 
    mschmitz
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 659   Unicorn
    Solution Accepted
    Another way to solve this is moving the sampling *into* the training phase of the cross validation. That way, you're building balanced models, but still validating on all data. 
    Also, sampling before the validation creates additional "knowledge" for the modeling process that you won't have later when applying the model. 

    Regards,
    Balázs
  • AmsDaniAmsDani Member Posts: 3 Contributor I
    Solution Accepted
    Thanks for your answers ! I will try it in this way you proposed Balázs!
Sign In or Register to comment.