Options

"[SOLVED] Question: Cross-Validation"

Q-DogQ-Dog Member Posts: 32 Contributor II
edited June 2019 in Help
Hello everyone,

while learning RapidMiner, I came across X-Validation (which is a useful thing!), but how does it exactly work?

Let's assume, we've got a data set of 100 examples and want to build a decision tree and the number of validation is 10.

There are (at least) 2 possibilities:
a) The output model is the decision tree based on the 100 examples, but the performance is always trained with 90 examples and tested with 10 examples (so the tree might always be different than the actual output tree!)
b) The output model is the decision tree based on the 100 examples and the performance is tested with 10 * 10 examples on the output tree.

After reading the description of X-Validation, I think a) is correct, but b) makes more sense, since the decision tree in a) might always be different than the actual output tree.

Which alternative is correct and if it is a) am I right that the tree might always be different?

Cheers Q-Dog

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Q-Dog,

    the cross validation operator indeed works like Option a). Option b) does not make any sense at all: here you would evaluate on the training data - which is exactly the thing you want to not do with cross validation!.

    For more information read my first answer in the following thread:

    http://rapid-i.com/rapidforum/index.php/topic,62.0.html

    And also Steffen's answer in this thread:

    http://rapid-i.com/rapidforum/index.php/topic,959.msg3598.html

    In short: don't confuse error estimation (done by cross validation) with model generation. The latter is done within cross validation but talking about these models is not useful and outside of the cross validation on the complete data. The model port of the operator is just for convenience reasons in order to get both, the model and the estimated performance for this type of model.

    Cheers,
    Ingo
  • Options
    Q-DogQ-Dog Member Posts: 32 Contributor II
    Wow thanks Ingo!

    Especially the two links made it crystal clear to me :)
Sign In or Register to comment.