Auto Model and overfitting

dgarrarddgarrard RapidMiner Certified Analyst Posts: 4 Contributor I
edited June 2019 in Help

I've been experimenting with Auto Model for Prediction and am generally happy with the concept and results.  

 

In the Auto Model process the sampling is set to 80/20.   Is this sufficient to control potential overfitting?  I am getting performance ranging from about 60% accuracy for Naive Bayes to 87% accuracy for GBT. I have less than 1000 rows of data and 20 attributes for each data set.  GBT is generating about 20 trees.  (I would potentially be operationalising with 100's of datasets and dedicated models per dataset) 

 

Tagged:

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @dgarrard - I think it is always prudent to be on alert for overfitting, regardless whether it's using Auto Model or using the "normal" RapidMiner methods. We all know that some models such as neural networks are prone to overfitting and should be used with caution, particularly on small data sets.

     

    My personal opinion is that the 80/20 split is widely used and is, in general, a reasonable split ratio and should be sufficient to avoid overfitting if used in conjunction with methods such as cross-validation (which is default in Auto Model).

     

    In the end, I always look at results with skepticism irrespective of the tool used until I actually inspect them to see how my "fit" looks on unseen data.

     

    Hope that helps.


    Scott

     

  • dgarrarddgarrard RapidMiner Certified Analyst Posts: 4 Contributor I

    Thank you for the quick reply Scott.  I'll try to get some testing done in the next couple weeks while my Auto-Model trial is still available!

     

    David 

  • tkaisertkaiser Member Posts: 8 Contributor I

    Hi this is very helpful, thank you. But i do have a follow up question...is the auto model showing a testing set accuracy or a training set accuracy in the results view? Because I ran a GBT in auto model on 4500 lines of data with 15 features, received "accuracy" of 90% and f-measure of 84%, but when i applied the model to new unseen data (which i actually purposely held-out from the training and cross validation process), the accuracy rates declines to below 50%. So I am not sure if I am running the validation process incorrectly, or perhaps not understanding what the results of the CV are telling me - as I had expected the auto model to produce an accuracy rate that was reflective of how well the model will perform in the future. Thanks much. 

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi,

     

    sorry for the delay, I missed this one here.  It shows the testing error of course.  If you read my correct validation opus linked above, you will see that we would NEVER care about training errors in the first place ;-)

     

    Such a drop can either be caused by a (significant) change in data distributions between training and validation sets.  Or, what I personally find more likely given the high amount, you probably did not apply exactly the same data preparation on your validation set.  More about this in the other thread here:

     

    https://community.rapidminer.com/t5/RapidMiner-Auto-Model-Turbo-Prep/Is-auto-model-showing-test-or-train-error/m-p/50902/highlight/false#M117

     

    Hope this helps,

    Ingo

Sign In or Register to comment.