turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Re: What about n models generated in cross validat...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-23-2016 10:17 AM

06-23-2016 10:17 AM

I have a question regarding cross validation in Linear regression model.

From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.

When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances and I can see that clearly when I use the operator "write as Text".

But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different data and are supposed to be different, right?)

I apologize if my question is not clear or too funny.

Thanks for reading, though!

Solved! Go to Solution.

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted
Options
## Re: What about n models generated in cross validation? Should we not take avg of all models (Linear

How to load processes in XML from the forum into RapidMiner: Read this!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-23-2016 05:59 PM

06-23-2016 05:59 PM

Solution

Accepted by topic author binaytamrakar

06-24-2016
04:07 AM

Hi,

This is not a funny question at all - I would even go so far and say this is probably one of the most frequently asked questions in machine learning I heard in my life

Let me get straight to the point here: Cross validation is **not** about model building **at all**. It is a common scheme to estimate (not calculate! - subtle but important difference) how well a **given** model will work on unseen data. So the fact that we deliver a model at the end (for convenience reasons) might lead you to the conclusion that it actually is about model building as well - but this is just not the case.

Ok, here is why this validation is an approximation of an estimation for a **given** model only: typically you want to use as much data as possible since labeled data is expensive and in most cases the learning curves show you that more data leads to better models. So you build your model on the complete data set since you hope this is the best model you can get. Brilliant! This is the **given** model from above. You could now gamble and use this model in practice, hoping for the best. Or you want to know in advance if this model is really good before you use it in practice. I prefer the latter approach ;-)

So only now (actually kind of *after* you built the model on all data) you are of course also interested in learning how well this model works in practice on unseen data. Well, the closest estimate you could do is a so-called leave-one-our validation where you use all but 1 data points for training and the one you left out for testing. You repeat this for all data points. This way, the models you built are "closest" to the one you are actually interested in (since only one example is missing) but unfortunately this approach is not feasible for most real-world scenarios since you would need to build 1,000,000 models for a data set with 1,000,000 examples.

Here is where cross-validation enters the stage. It is just a more feasible approximation of something which already was only an estimation to begin with (since we ommitted one example even in the LOO case). But this is still better than nothing. The important thing is: It is a performance estimation for the *original* model (built on all data), and *not* a tool for model selection. If at all, you could misuse a cross-validation as a tool for example selection but I won't go into this discussion now.

Beside of this: You might have an idea how to average 10 linear regression models - what do we do with 10 neural networks with different optimized network structures? Or 10 different decisions trees? How to average those? In general this problem can not be solved anyway.

You might enjoy reading this older discussion where I spend more time discussion the different options besides averaging: http://community.rapidminer.com/t5/RapidMiner-Studio/Interpretation-of-X-Validation/m-p/9204

The net is: they are all not a good idea at all and you should do the right thing. Which is built one model on as much data as you can and use cross-validation to estimate how well **this** model will perform on new data.

Hope that clarifies this,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

4 REPLIES

Highlighted
Options
## Re: What about n models generated in cross validation? Should we not take avg of all models (Linear

How to load processes in XML from the forum into RapidMiner: Read this!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-23-2016 05:59 PM

06-23-2016 05:59 PM

Solution

Accepted by topic author binaytamrakar

06-24-2016
04:07 AM

Hi,

This is not a funny question at all - I would even go so far and say this is probably one of the most frequently asked questions in machine learning I heard in my life

Let me get straight to the point here: Cross validation is **not** about model building **at all**. It is a common scheme to estimate (not calculate! - subtle but important difference) how well a **given** model will work on unseen data. So the fact that we deliver a model at the end (for convenience reasons) might lead you to the conclusion that it actually is about model building as well - but this is just not the case.

Ok, here is why this validation is an approximation of an estimation for a **given** model only: typically you want to use as much data as possible since labeled data is expensive and in most cases the learning curves show you that more data leads to better models. So you build your model on the complete data set since you hope this is the best model you can get. Brilliant! This is the **given** model from above. You could now gamble and use this model in practice, hoping for the best. Or you want to know in advance if this model is really good before you use it in practice. I prefer the latter approach ;-)

So only now (actually kind of *after* you built the model on all data) you are of course also interested in learning how well this model works in practice on unseen data. Well, the closest estimate you could do is a so-called leave-one-our validation where you use all but 1 data points for training and the one you left out for testing. You repeat this for all data points. This way, the models you built are "closest" to the one you are actually interested in (since only one example is missing) but unfortunately this approach is not feasible for most real-world scenarios since you would need to build 1,000,000 models for a data set with 1,000,000 examples.

Here is where cross-validation enters the stage. It is just a more feasible approximation of something which already was only an estimation to begin with (since we ommitted one example even in the LOO case). But this is still better than nothing. The important thing is: It is a performance estimation for the *original* model (built on all data), and *not* a tool for model selection. If at all, you could misuse a cross-validation as a tool for example selection but I won't go into this discussion now.

Beside of this: You might have an idea how to average 10 linear regression models - what do we do with 10 neural networks with different optimized network structures? Or 10 different decisions trees? How to average those? In general this problem can not be solved anyway.

You might enjoy reading this older discussion where I spend more time discussion the different options besides averaging: http://community.rapidminer.com/t5/RapidMiner-Studio/Interpretation-of-X-Validation/m-p/9204

The net is: they are all not a good idea at all and you should do the right thing. Which is built one model on as much data as you can and use cross-validation to estimate how well **this** model will perform on new data.

Hope that clarifies this,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-24-2016 03:17 AM

06-24-2016 03:17 AM

I second of course everything what Ingo said. But i would like to add one more punch line:

(Cross-)Validation is not about validating a model but about validating the method to generate a model.

Best,

Martin

--------------------------------------------------------------------------

Head of Data Science Services at RapidMiner

Head of Data Science Services at RapidMiner

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-24-2016 04:06 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-24-2016 04:08 AM