Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Other ways to Validate results

domedome Member Posts: 12 Learner I

I have a database of 84 rows and 400 attributes, which is a classifier problem. I prepared the Data, that i can exercise the decission tree or other tree models. To evaluate and test the Model i use the performance operator, espacially the accuraccy. I split the Data in a ratio of 80/20. 80% is the trainingset and 20% the testset.

The result of this Model is an accuracy of 80%. When I change the Split type for example from statified to shuffled or the ratio from 80/20 to 70/30, the accuracy drops to 60%. Now my question:

Is this phenomenon normal? Is there any other way to validate a classification model? And probably a bad question which only can be answered by seeing the process: Why does the model accuracy varies so drastically by just the splitting rate or splitting type?

Thanks a lot!

Best Answers

  • varunm1varunm1 Member Posts: 1,207 Unicorn
    edited July 2019 Solution Accepted
    Hello @dome

    Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.

    Here is a detailed thread on the working of cross-validation.


    Hope this helps. Please inform if you need more info. 

    Be Safe. Follow precautions and Maintain Social Distancing

  • varunm1varunm1 Member Posts: 1,207 Unicorn
    Solution Accepted
    Hello @dome

    Here are the reasons when I use stratified or Shuffled.

    Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have  15 Class A and 5 Class B samples.

    Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.

    Now, why stratified and not shuffled?

    Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.

    Hope this helps

    Be Safe. Follow precautions and Maintain Social Distancing


  • domedome Member Posts: 12 Learner I

    Yes, that helps a lot. Thanks!

    Another question:
    I know the difference between stratified and shuffled sampling. What do I use when? and what should i use in my case? and why?

    Thank you!
Sign In or Register to comment.