Options

The Decision Tree gave impossible result

MarkusWMarkusW Member Posts: 22 Contributor I
edited October 2021 in Help



I just trained a machine using a Decision Tree, that reached an F-score of 99,7%.
Which sounds good until you hear, that naive bayes only got 66,4%
the highest score on that dataset I found was 98,2% using deep learning
The highest CREDIBLE score I found on that dataset was 78,5%

The design is based off of this video:


All I did was replace the Naive Bayes operator in the Crossvalidation with the Decision Tree operator.
Even with 10-fold Crossvalidation I should still not get much more than 70%...

The immediate cause of the high score is, that for some reason there is a strong correlations between the label and the id, however I do not know how to limit which collumns the algorithm uses.
The question is, what did I do wrong? How do I make it right?

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Solution Accepted
    often its just because the two sets for the two classes got append. So the first half of the data set is true, the second half is false?

    Otherwise: Often ids correlate with dates, which correlate with the label.

    What you want to do is either use Select Attributes and remove the id or set role and set the role of id to id.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Did you look at the tree? What is it doing?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    MarkusWMarkusW Member Posts: 22 Contributor I
    I probably should have, before starting it up with random forest, to see if the problem persists...
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    look at the decision tree. Maybe you left an attribute in the data that correlates so strongly with the label, but wouldn't be available for future data.

    Is the tree complex? Are the decisions obvious? 

    You can put breakpoints on various parts of the process (I'd try with the Decision Tree and Performance) to look at the different validation steps. 

    Regards,
    Balázs
  • Options
    MarkusWMarkusW Member Posts: 22 Contributor I
    I can say with certainty, that the only thing, that correlates remotly as strong with the correct label, is the label itself.
    I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.
    the Decision Tree had very few settings I could actually change. My best guess is, that I should have either used a "different" Decision tree operator, if there are multiple, or that somehow the 10-fold crossvalidation doesn't work the same way, depending on learning algorithm and I should have changed settings there.
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    If this happens again, look at the stepwise execution results. If you get a very simple tree, or unbelievable performance results in different executions, the breakpoints help you identify the problem.

    Sometimes multiple attributes together correlate with the result but not individually. Decision Tree might be better at catching some of these situations.

    Regards,
    Balázs
  • Options
    MarkusWMarkusW Member Posts: 22 Contributor I
    Ok, despite the correlation not being supposed to be nearly that strong, it's still unwanted, that most of the factors in the tree appear to be the id of the dataset.
    My guess is, that if I forbid it from doing that, I'd get much better results.
    I assume I do that with the "Set Role" operator, but I don't know how.
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.


    Thats not true. Especially a NB algorithm can be confused very quickly by the other 'noise' attributes. This is not true for a Decision Tree.


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    MarkusWMarkusW Member Posts: 22 Contributor I
    Yes, apparently there is a weirdly strong correlation between "ID" (basically just the line number) and the label. I just need to find out, how to exclude this from the columns, that the algorithm is allowed to use.
    Help is welcome.
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    You can set the role of this column to "id" using Set Role. If you already have an attribute with the role id, just enter a second name (e. g. ItemID). Everything marked with a special role, custom or built-in, is excluded from modeling.
Sign In or Register to comment.