How can I improve the performance of my model with an imbalanced database for a classification issue

Samira_123Samira_123 Member Posts: 9 Contributor I
edited May 2020 in Help
Hi,

This is my fist time using RapidMiner. I have to do a classification for an assignment. 
The database is really imbalanced. I have 180 out of 12800 donors who donated (class - 1) in the past and the remaining donors didn't donated (class - 0).

When I created and selected relevant attributes, the class precisions were relevant but the class recall for class 1 was totally irrelevant. I had something close to 8%.

However, when I used the 'Sample' operator to balance my database, the class recall and the class precision were around 60%. I am not sure if it is the right thing to do because at the end, I end up with 360 donors instead of 12 800.

At the end, I have to use a test set of more than 12 000 donors to predict which donor will donate. 

Thank you

NB: My kappa is equal to 0.267

Best Answer

Answers

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @Samiraaa_123

    How are you validating your model? Is it cross-validation or split validation? 

    Sampling is good when it is applied to the training set. It is not recommended to apply sampling on the whole dataset. As the dataset is small, you can try upsampling your minority class using SMOTE operator present in the Operator toolbox (Download from Marketplace) instead of downsampling. Also, you can try weighting your examples instead of sampling, this word only for few algorithms like neural networks, decision trees, etc,. Weighting doesn't alter your sample-sizes but assigns equal importance to both classes. This can be done using Generate weights (Stratification). You should check if the algorithm you are trying to use will accept this weighting. That can be found by right-clicking on the algorithm operator and then click on Show operator info. There if you see a green tick after "Weighted Examples" then that algorithm is fine for weighting.

    Are your tuning the model's hyperparameters? Are you trying different algorithms?
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Samira_123Samira_123 Member Posts: 9 Contributor I
    edited May 2020
    Hi @varunm1  

    I used the 'Cross Validation' operator to validate my model. I tried to balance my dataset by using the Generate weights (Stratification) before since I saw this could work on the forum but it says that the 'Random Forest' (operator I am using for the classification) will disregard that. 

    Does the SMOTE operator need to be placed just before the cross-validation? 

    Thank you so much for your answer
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Yep, Random forest doesn't accept weights. You should apply SMOTE or any sampling operators in the training part of the cross-validation. If you apply on whole data, it will bias your model and this model doesn't scale for new data that might come in the future. You can also apply the Optimize Parameter (Grid) to search for good hyperparameter (number of trees, maximal depth, etc.) for the random forest. Also try different models like gradient boosting, neural networks, SVM etc.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Samira_123Samira_123 Member Posts: 9 Contributor I
    Hi @varunm1,

    Thank you for your advice :) 
    I've been trying to do what you said. My class precision is really good but the class recall for the class 1 is irrelevant. 
  • Samira_123Samira_123 Member Posts: 9 Contributor I
    Hi @varunm1,

    Thank you. You were very helpful. 
    Wish you a good weekend :)

Sign In or Register to comment.