How to make automodel doing cross-validation?

wanglu2014 · June 2018

Thank for your attention. In automodel, imported data are splited into training and validation with a ratio. However, for improve the reliability of model, can we modify the spliting process into cross-validation?

Telcontar120 · June 2018

Certainly, just open the process for the model you want and then change the process from split validation into cross validation and rerun.

Fatma · March 2019

Excuse me @Telcontar120 , I have the same question and couldn't understand from where to change the process from split validation into cross validation? I'm very sorry but I'm still beginner to the RapidMiner. I found split data block is this what you mean? if so how to split the data to for example leave on out, or k=4 in k-fold cross validation?

Telcontar120 · March 2019

No, what I meant is that once you have the process, you can select the split validation operator, and replace it with the cross-validation operator instead. This can be done by right-clicking on the split validation operator, or by manually copying the new cross-validation operator in, copying the operators out of the split validation into the cross-validation, and then deleting the split validation operator. Same results. In both cases, just make sure you have wired up the internal operators correctly. See the cross-validation tutorial from the help if you need to double-check.

rfuentealba · March 2019

Hello,

Today I was showcasing RapidMiner AutoModel to a new coworker. With the Titanic dataset, if you select a Logistic Regression (that is the case I remember, but there might be many others) there is no such thing as a Split Validation operator. Instead, the process performs a Split Data operator in an early stage and applies the Performance operators as the final ones, which is what I call the manual way to perform validation.

In that case, it is not as simple as changing the operator. (Others are, though).

My advice would be to reorder the process and understand how it works because while AutoModel is a great beginning of a data science project, it is still a beginning: our project still lacks proper documentation (it still cannot generate the documentation for our domain expertise), removal of boilerplate steps (if our dataset doesn't have text, why handling text?), and adapting the process to our use cases.

I know, this is not the kind of happy answer that magically solves our problems and having to go through the process is especially frustrating for newcomers to RapidMiner, but please focus in that RapidMiner does have a #noblackboxes philosophy that allows people to go from nought to 60 in a few seconds by having access to what the process does.

(@Telcontar120, are you having the same deja vu I had? Wasn't this the topic of our conversation when we met each other?)

Hope this helps,

Rodrigo.

IngoRM · March 2019

Yip, that's right. By the way, the way how we are doing the performance estimation in AM is actually quite clever. The reason for why cross-validation is a more robust estimator is that it reduces the dependency on a test set being "easier" or "harder" for the trained model. We do something similar in AM by training a model on the majority of the data and then creating multiple hold-out sets while removing the outliers before averaging. On 80+ data sets with more than 1,000 rows I found only two examples where the difference between this approach and the performance created with a full-blown cross-validation was statistically significant. So for all practical purposes, especially in the early phase of a data science project, the validation approach of AM is pretty much as good as a full cross-validation but 5x - 10x faster.

Don't get me wrong, I am not arguing against cross-validation, quite the opposite. I just wanted to point out that we came up with a practical approach better balancing run time with estimation robustness which in my experience works well enough for most applications.

Just my 2c,

Ingo

varunm1 · March 2019

@IngoRM this looks great. For huge datasets, this method in AM works like a gem and also seems reliable based on your test. I am a bit confused about why you used the holdout sets in the process when you are splitting data randomly. Now it's clear.

SGolbert · March 2019

Hi @IngoRM

nice to know that you looked throughly into the matter, I trust AM even more right now.

I think that once an adequate model is found in AM, one should train a new model with all the data in a new process, possibly with hyperparameter tuning.

Regards,

Sebastian

IngoRM · March 2019

We are actually looking into a new deployment feature for Auto Model as we speak so simplify this process of retraining etc. Stay tuned ;-)

Yin · September 2022

@IngoRM I see that your post is from 2019, has this been implemented yet?

yoni1961 · December 2022

@IngoRM I see that your post is from 2019, has this been implemented yet? Same question.... We have a small data set (106) and would like to use Cross-Validation.... Any more we need to know beyond your detailed (and GREAT explanation above????? (what you call my 2 cents with is much more than that????:) @Telcontar120

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to make automodel doing cross-validation?

Answers

Be Safe. Follow precautions and Maintain Social Distancing