"Regression problem with cross-validation"

dramhampton · March 2019

Hi all

I have a concern about the output from cross-validation with regression. The CV operator should break the data into (say) 10 segments and sequentially use each 10% of the data as a test set for a model built with the other 90% to measure performance - but when reporting out its model, that should be done with all the data, and the predictions made with the model using all the data.

That means that if you have a single attribute to use as a predictor, and plot the predicted value against this, you should get a straight line.

However, I get a jerky line. This is specific to CV, if I try the same exercise with split validation it works fine.

Am I misunderstanding the way CV works or...?

To make it easier to see the problem I have adapted the Iris dataset to illustrate it, with this process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">

</context>

</operator>

</list>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</operator>

</process>

</operator>

</operator>

</process>

</operator>

</process>

Many thanks for your help

David

sgenzer · March 2019

Hi David -

OK I understand. This is a common misunderstanding. I'm going to briefly explain here, and due to the fact that this question comes up a LOT, I'm going to write a KB as well.

Basically in short, the "tes" output is the appended application of each Apply Model inside the x-validation, NOT the application of the model on the whole set.

Give me an hour or so to write this KB so you can see what I'm getting at.

Scott

sgenzer · March 2019

ok pls look at this...sorry abt the formatting

https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio

Scott

sgenzer · March 2019

Hi David -

Yes of course you should get a straight line plotting predicted(a4) vs a2, which I get when I run your process. Where do you see a jerky line?

Image: https://us.v-cdn.net/6030995/uploads/editor/23/yqfd39i9lckc.png

Scott

dramhampton · March 2019

Oops I forgot to mention something! I added an additional Apply Model operator after Cross-validation to show what you should get, and that produces a straight line. Now disable this second Apply Model and you will see the direct output from CV. Many thanks Scott!

Telcontar120 · March 2019

Yes @sgenzer I think this would be a very helpful KB article. This question does come up a lot!

dramhampton · March 2019

Many thanks Scott. That's cracked it. The workaround to insert a new Apply Model operator will work well and I will be able to explain to people why it is needed. Very helpful!
DH

sgenzer · March 2019

great. Glad that helped. I'd like to use this article for other purposes so please provide suggestions if something is not clear. Same of course for everyone else... @Telcontar120

Telcontar120 · March 2019

@sgenzer this looks great to me...I think that color shading on the "tes" output results really clarifies things.
Of course one might suggest that having another output for the true scored output from the final cross validation model would be a nice enhancement to the cross-validation operator, but that's another discussion!

MartinLiebig · March 2019

@Telcontar120 ,

how so? There is no way to apply the final model on the training data.

BR,

Martin

Telcontar120 · March 2019

@mschmitz what do you mean? It's mechanically possible, in the sense that you can accomplish the same thing simply by outputting or storing the final model from cross-validation, then applying the model on the full dataset used as the cross-validation input (just as noted earlier in the forum thread). So I am not sure what you mean by "there is no way to apply the final model on the training data". We could debate whether this is a useful thing to have or not, but I think it is definitely a possible thing to produce.

sgenzer · March 2019

all good points - it's always the same challenge of how much to bundle into one operator. Do you build in Apply Model to see the model applied on the entire data set, or leave it as is? I would advocate for the latter. But a better question is why do we port the testing output at all. Does it serve any purpose? And yet if the purpose of Cross Validation is purely to find a true estimate of performance, why do we port the model at all? But then you get into this world which does NOT seem "fast and simple"...

Image: https://us.v-cdn.net/6030995/uploads/editor/cy/ecjk4pyafkbr.png

You could even ask (and I think it's a legitimate question) why the Apply Model needs to be inserted manually on the Testing side of Cross Validation. Is there ever a situation when you do NOT? Wisdom of Crowds shows that people insert it 100% of the time

Image: https://us.v-cdn.net/6030995/uploads/editor/v7/ho0bda4tu8ni.png

Call me crazy but I have a hunch that @RalfKlinkenberg and @IngoRM grappled with these questions a long time ago and likely have good reasons for setting it up this way. Not saying it cannot be changed...just giving these guys the benefit of the doubt that there is a good rationale for doing it the way it's done here.

Great discussion this morning!

Scott

MartinLiebig · March 2019

@Telcontar120 ,

extactly this is statistically not sound. You cannot trust scores which are the result of this. You may have overtrained results.

BR,

Martin

Telcontar120 · March 2019

@mschmitz I completely agree with your point about overfitting, as you should probably already know from our many earlier discussions about this topic

If the main purpose of the output would be to assess performance then it is not nearly as useful as the cross-validation performance output, which is already coming out of the operator.

However, there are other reasons to want to review the scores on the entire input set---for example, if you want to look at score distributions and measure potential score drift over time, you typically are going to start with the baseline of the scores from the original development sample as a comparison point for later samples. Or in the case of another recent thread, the user wanted to confirm the threshold value that was being applied. In fact I recall an earlier bug in one of the learners (logistic regression perhaps) where there was a problem with this and it was only caught because of a similar output analysis of scores on the full population.

@sgenzer I also agree that this is not at all an urgent issue, but simply because it has been handled one way in the past in RapidMiner doesn't necessarily mean that it could not use improvement. There are lots of things that have changed in RapidMiner over the years, and it is always worth a discussion on the merits of any specific idea for future changes.

MartinLiebig · March 2019

@Telcontar120 but where is the problem with the tes port? That gives you a fair estimate of these distributions

Telcontar120 · March 2019

@mschmitz they may provide a fair estimate but are not actually generated using the same model. So from a compliance perspective, they may not be sufficient. There are many regulated industries in the US where this would not be an acceptable starting point for model performance tracking.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Regression problem with cross-validation"

Best Answers

Answers