Options

Performance Vector of Decision Trees

auxiliumauxilium Member Posts: 1 Contributor I
edited October 2019 in Help
Hello,

I think I've some understanding problems regarding the performance vector of a decision tree.

I've a  training data set with 16 records, which are categorized negative or positive.
I created a process and rapidminer created a new decision tree, which classifies each record correctly. ( I even checked every record manually by myself.)
Now I'd like the system to check the performance, so i added a "nominal cross validation".

Then the system reproduces the same decision, but the performance vector of this tree says, that both recall and precision are not 100%.

What's the reason for it?

I've checked the dataset manually and the decision tree seems to be allright for that specific dataset. But If i used this validation function, it says it's not?

I'dont understand this atm.

Would you be so nice and try to explain it to me?

Regards

auxilium

Answers

  • Options
    TimboTimbo Member Posts: 14 Contributor II
    Hi,

    Taking into account what few information you have given the result is not too surprising to me. Depending on how many examples you use for training and testing there might be significant influence of statistical fluctuations on the decision tree, for both training and testing. Maybe you could post the confusion matrix. And then maybe run a Split-Validation and post the confusion of that one as well. Perhaps one can gain some more insight from that.

    Timbo
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey auxilium, at first you tested the decision tree on the training data - i.e. the tree "knows" all the examples you classified. With the X-Validation the tree is created on only lets say 90% of the data and then applied on the other 10%. Thus it sees new examples which it didn't know before. Because of that it classifies some examples wrong - this is called the generalization error. Since usually you apply a model on new data, the performance of the X-Validation is the one you should trust.

    You said that the very same tree is created - that's true for the "model" output of the X-Validation, since it creates the model on all of the data. But as stated above, in each iteration the X-Validation creates a tree on the current subset, which usually differs from the model on all data. Try to set a breakpoint inside the X-Val.

    Best regards, Marius
Sign In or Register to comment.