Decision Tree + Validation: different validation methods yield different trees.

Contributor II

Decision Tree + Validation: different validation methods yield different trees.

Hello! I have a dataset with about 500 examples and 50 attributes. I try to use the Decision Tree operator nested in either 5 or 10 fold X-validation or 5 fold bootstrapping validation operator. I let the tree grow for 10 steps without pre-pruning. This results in 2 trees similar on the top levels but quite different from level 4 onwards. Could someone please explain how the validation methods affect tree structure? And what is the criteria to decide which tree to pick? Many thanks!

Re: Decision Tree + Validation: different validation methods yield different trees.

Hi maverik,

you used different evaluation functions. This shall very likely lead to different performance values for your tree. X-Validation splits the input set into as many partitions as foldings specified. Then it trains a model on n-1 of those partitions and tests it with the remaining one. This is done n-times so that every partition has served as a test set. Bootstrapping builds a training set according to the ratio parameter. In case you have a dataset with n examples connected to the input and a ratio of 0.8 specified the resulting training set will have n*0.8 examples. The examples are chosen randomly with replacement, so that it is very likely that there will be duplicate entries in the training set. All examples which are not part of the training set will be put to the test set. This also means that for very high ratio settings the test set may be empty. Both operators deliver a performance value calculated as average of their validation runs.
But, the model output for both operators should deliver a model trained using the whole data sets. Just in case both tree learners are equally configured (the standard decision tree does not use random values) you should receive the same model. Which validation method is best suited depends on your data set and the domain it is derived from. In most cases the cross validation is a good choice since it guarantees that every single item of your data set has been part of an testing procedure. The 10-fold cross validation is also very common and can be seen as a kind of industry standard.