Interpretation of labeled data after cross-validation

328815dh · July 2017

Dear all,

I am having trouble interpreting the exported labeled data of the cross-validation operator. Nested inside it are either a regression model or a neural net model (we are trying to compare performance).

However, using this method (through the 3rd output port of the cross-validation, test), there is an output of the actual and the predicted value for all rows in the dataset.

Are these predictions being iteratively generated during the folds (and thus each based on a different model) or are they the result of the best performing model being ran on the entire set?

I hope you can clarify this, and also that is has not been answered many times already. Did perform a search but could not find this in the forums.

Thanks a lot in advance.

MartinLiebig · July 2017

Hi,

it is this:

Are these predictions being iteratively generated during the folds (and thus each based on a different model)

All other things are not possible. Keep in mind that X-Validation is not returning "the best" model as a result, but the model which is built on the full data set. You cannot apply this to the data. You also can not the result of "the best" model to the full data, because part of this would be in the training set.

Best,

Martin

328815dh · July 2017

Dear Martin,

That is already very clarifying. However, I keep having difficulties with what then the output of the model port is. You describe it as the model that is being built on the full dataset. This is also what the documentation states. But is the meaning of this that after the 10 folds (used for calculating the average performance) it does another iteration of training on the full dataset and testing on the (same) full dataset?

Also thank you very much for replying so quickly.

Thomas_Ott · July 2017

Yes, that is correct. Cross Validation will iterate over 10 randomly selected subsets of data (if k = 10) and then do a full training on the entire dataset and deliver that model to the MOD port.

Telcontar120 · July 2017

However, for clarity, the model output is trained on the full dataset but the reported performance is not on the full model, but rather the average performance across the k-folds of the cross validation. There would be no way to train a model on the full dataset and also report the performance on the full dataset using a separate test sample (since there would be no records left for it).

328815dh · July 2017

Hi Thomas & Brian,

Now it is completely clear to me. Or well, at least what is being output. Pretty much always have trouble with judging the 'validity' of implementing a certain step, but that's for another time.

Really appreciated the help!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Interpretation of labeled data after cross-validation

Best Answer

Answers