Question: Cross-Validation

Hello everyone,

while learning RapidMiner, I came across X-Validation (which is a useful thing!), but how does it exactly work?

Let's assume, we've got a data set of 100 examples and want to build a decision tree and the number of validation is 10.

There are (at least) 2 possibilities:
a) The output model is the decision tree based on the 100 examples, but the performance is always trained with 90 examples and tested with 10 examples (so the tree might always be different than the actual output tree!)
b) The output model is the decision tree based on the 100 examples and the performance is tested with 10 * 10 examples on the output tree.

After reading the description of X-Validation, I think a) is correct, but b) makes more sense, since the decision tree in a) might always be different than the actual output tree.

Which alternative is correct and if it is a) am I right that the tree might always be different?

Cheers Q-Dog


    Hi Q-Dog,

    the cross validation operator indeed works like Option a). Option b) does not make any sense at all: here you would evaluate on the training data - which is exactly the thing you want to not do with cross validation!.

    In short: don't confuse error estimation (done by cross validation) with model generation. The latter is done within cross validation but talking about these models is not useful and outside of the cross validation on the complete data. The model port of the operator is just for convenience reasons in order to get both, the model and the estimated performance for this type of model.

    Wow thanks Ingo!

    Especially the two links made it crystal clear to me :)
