automodel in rapidminer

hoonhoon Member Posts: 1 Newbie
edited June 2019 in Help
hi, rapidminer comunity, it is great and facinating to be introduced by auto model. how to identify overfitting model in rapidminer automodel?


  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Welcome to the Community :-)
    The following discussion should be interesting for you: https://community.rapidminer.com/discussion/comment/48190#Comment_48190
    The short summary is that there always will be overfitting and you detect it in Auto Model the same way as you do in general:  You measure the validation accuracy and if this is getting worse for more complex models you are in overfitting-land.  There is a lot of things you can screw up in validation though (most people and tools do).  Most frequent error is that people are only validation the actual machine learning model building but not the impact of data preprocessing.  But rest assured that Auto Model is taking good care of all of that for you so the performance you see are true and correctly validated performances.
    If you want to dive a bit deeper on the topic, I also would recommend this "little" white paper I wrote on the topic of correct validation some time ago: https://rapidminer.com/resource/correct-model-validation/
    My one-line-recommendation is to be less concerned about overfitting (it always happens!) and more concerned about correct validation since this guarantees that there are no negative surprises down the road.
    Hope this and the links above help.  Best,
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Just to add a further comment here, if it were generally possible to identify specifically which parts of the model were only present from overfitting, then it would be easy to remove only those parts.  But that's unfortunately not how it works :-)
    As Ingo said, the main thing is to understand how the model is going to perform on unseen data in the future, which will include both the effects from the accurate capture of replicable patterns and relationships in the data that should be present in all samples as well as the effects from overfitting to the idiosyncrasies of your development dataset.  As long as you are using correct validation, you will have a pretty good estimate of that overall performance but you won't be able to partition it out cleanly.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.