Options

Other ways to Validate results

domedome Member Posts: 12 Newbie
Hello,

I have a database of 84 rows and 400 attributes, which is a classifier problem. I prepared the Data, that i can exercise the decission tree or other tree models. To evaluate and test the Model i use the performance operator, espacially the accuraccy. I split the Data in a ratio of 80/20. 80% is the trainingset and 20% the testset.

The result of this Model is an accuracy of 80%. When I change the Split type for example from statified to shuffled or the ratio from 80/20 to 70/30, the accuracy drops to 60%. Now my question:

Is this phenomenon normal? Is there any other way to validate a classification model? And probably a bad question which only can be answered by seeing the process: Why does the model accuracy varies so drastically by just the splitting rate or splitting type?

Thanks a lot!

Best Answers

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019 Solution Accepted
    Hello @dome

    Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.

    Here is a detailed thread on the working of cross-validation.

    https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio#latest

    Hope this helps. Please inform if you need more info. 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Solution Accepted
    Hello @dome

    Here are the reasons when I use stratified or Shuffled.

    Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have  15 Class A and 5 Class B samples.

    Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.

    Now, why stratified and not shuffled?

    Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.

    Hope this helps
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

Answers

  • Options
    domedome Member Posts: 12 Newbie
    Hi,

    Yes, that helps a lot. Thanks!

    Another question:
    I know the difference between stratified and shuffled sampling. What do I use when? and what should i use in my case? and why?

    Thank you!
Sign In or Register to comment.