Issue found in feature weight of RandomForest for regression

marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
edited September 2020 in Help
It seems that there is an issue or a bug in the feature_weights returned by RandomForest operator, but only for regression. I found that problem on one dataset but I reconstructed it on IRIS dataset for which features a3 and a4 are the most important but according to the regression RandomForest these two features are the least important.
I evaluated other implementations of RandomForest for regression which returns correct weights (weights which are expected).

Best regards
Marcin

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I had submitted a big quite some time ago regarding the RandomForest weights.  It looks like it may still be uncorrected and this is another example of the same underlying issue.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    Hi

    I'm surprised that such requests are ignored. Many use RandomForest weights as a feature importance indicator and make serious decisions based on it.
    It would be also nice if someone from RM would answer "thank you, we will analyze the reported issue" but there is no response.

    Below I attach another process where it can be seen that the attribute with pure noise is the second most important variable according to RapidMiner implementation of RandomForest (the most important also seems to be attribute selected by chance). Because the trees are simple (5 trees of depth 5) one can count how many times each attribute appeared as a decision node. The noise variable is the least important.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    I have the odd feeling, that the weights generation does not take the number of examples into account, but just sums the gain node. Would this explain the behaviour?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    HI

    I haven't checked the source code but I have a feeling that the problem is deeper. In the example from my previous post, where the Random Forest consists of 5 trees it can be seen that the noise attribute A5 appears only twice in the trees, while A3 and A4 appear the most often. For classification, the weights work correctly so I think that may be related to the criterion and its properties.
    Never the less it would be great if RM correct it in the upcoming release.

    Best regards
  • gmeiergmeier Employee, Member Posts: 25 RM Engineering
    edited January 2021
    thank you for the bug report. We found the problem and fixed it. It will be part of the next release.
Sign In or Register to comment.