How to make automodel doing cross-validation?

wanglu2014wanglu2014 Member Posts: 18 Contributor I
edited November 2018 in Help

Thank for your attention. In automodel, imported data are splited into training and validation with a ratio. However, for improve the reliability of model,  can we modify the spliting process into cross-validation?

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,046   Unicorn

    Certainly, just open the process for the model you want and then change the process from split validation into cross validation and rerun.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    sgenzerdbabrauskaiteBalazs_Hamornik
  • FatmaFatma Member Posts: 1 Newbie
    edited March 11
    Excuse me @Telcontar120 , I have the same question and couldn't understand from where to change the process from split validation into cross validation? I'm very sorry but I'm still beginner to the RapidMiner. I found split data block is this what you mean? if so how to split the data to for example leave on out, or k=4 in k-fold cross validation?
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,046   Unicorn
    No, what I meant is that once you have the process, you can select the split validation operator, and replace it with the cross-validation operator instead.  This can be done by right-clicking on the split validation operator, or by manually copying the new cross-validation operator in, copying the operators out of the split validation into the cross-validation, and then deleting the split validation operator.  Same results.  In both cases, just make sure you have wired up the internal operators correctly.  See the cross-validation tutorial from the help if you need to double-check.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    dbabrauskaite
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member Posts: 283   Unicorn
    Hello,

    Today I was showcasing RapidMiner AutoModel to a new coworker. With the Titanic dataset, if you select a Logistic Regression (that is the case I remember, but there might be many others) there is no such thing as a Split Validation operator. Instead, the process performs a Split Data operator in an early stage and applies the Performance operators as the final ones, which is what I call the manual way to perform validation.

    In that case, it is not as simple as changing the operator. (Others are, though).

    My advice would be to reorder the process and understand how it works because while AutoModel is a great beginning of a data science project, it is still a beginning: our project still lacks proper documentation (it still cannot generate the documentation for our domain expertise), removal of boilerplate steps (if our dataset doesn't have text, why handling text?), and adapting the process to our use cases.

    I know, this is not the kind of happy answer that magically solves our problems and having to go through the process is especially frustrating for newcomers to RapidMiner, but please focus in that RapidMiner does have a #noblackboxes philosophy that allows people to go from nought to 60 in a few seconds by having access to what the process does.

    (@Telcontar120, are you having the same deja vu I had? Wasn't this the topic of our conversation when we met each other?)

    Hope this helps,

    Rodrigo.
    FatmadbabrauskaiteTelcontar120yyhuang
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,536  RM Founder
    Yip, that's right.  By the way, the way how we are doing the performance estimation in AM is actually quite clever.  The reason for why cross-validation is a more robust estimator is that it reduces the dependency on a test set being "easier" or "harder" for the trained model.  We do something similar in AM by training a model on the majority of the data and then creating multiple hold-out sets while removing the outliers before averaging.  On 80+ data sets with more than 1,000 rows I found only two examples where the difference between this approach and the performance created with a full-blown cross-validation was statistically significant.  So for all practical purposes, especially in the early phase of a data science project, the validation approach of AM is pretty much as good as a full cross-validation but 5x - 10x faster.
    Don't get me wrong, I am not arguing against cross-validation, quite the opposite.  I just wanted to point out that we came up with a practical approach better balancing run time with estimation robustness which in my experience works well enough for most applications.
    Just my 2c,
    Ingo
    varunm1dbabrauskaiteSGolbertsgenzer
  • varunm1varunm1 Member Posts: 199   Unicorn
    edited March 12
    @IngoRM this looks great. For huge datasets, this method in AM works like a gem and also seems reliable based on your test. I am a bit confused about why you used the holdout sets in the process when you are splitting data randomly. Now it's clear.
    Regards,
    Varun
    IngoRM
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 257   Unicorn

    nice to know that you looked throughly into the matter, I trust AM even more right now.

    I think that once an adequate model is found in AM, one should train a new model with all the data in a new process, possibly with hyperparameter tuning.

    Regards,
    Sebastian

    IngoRMsgenzer
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,536  RM Founder
    We are actually looking into a new deployment feature for Auto Model as we speak so simplify this process of retraining etc.  Stay tuned ;-)
    SGolbertsgenzer
Sign In or Register to comment.