Variable importance measure

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hi,

in general, the learner "random forests" provides an algorithm to
measure the importance of the predicted variables and works as
follows:

"Variable importance: This is a difficult concept to
define in general, because the importance of a
variable may be due to its (possibly complex)
interaction with other variables. The random
forest algorithm estimates the importance of a
variable by looking at how much prediction error
increases when OOB (out-of-bag) data for that variable
is permuted while all others are left unchanged.
The necessary calculations are carried
out tree by tree as the random forest is
constructed."

It's described in Breiman's (developer of random forests)
paper [1] and is for example implemented in the GNU R
randomForest package. GNU R determines for each variable
its Gini index that indicate how important that variable
is for the classification. It's a very nice feature and
the results can be drawn in a bar diagram.

Is this "variable importance measure" also possible in
RapidMiner's RandomForest? I couldn't find it anywhere.
Or can the variable importance be estimated in a different
way using RapidMiner?

Thank you for your help.

Best regards,
Paul

[1] L. Breiman. Manual on setting up, using and understanding
random forests.

Answers

  • steffensteffen Member Posts: 347 Maven
    Hello

    As far as I understand the posted text (I admit I havent read the paper) variable importance is equivalent to the splitting criterion within the construction of a DT. In RapidMiner this parameter is named "criterion" for both RandomTree and RandomForest.

    Note that accuracy and prediction error are measuring the same thing, also accuracy is the optimistic and prediction error is the negative point of view.

    If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".

    hope this was helpful

    regards,

    Steffen
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Hi Steffen,

    thank you for your answer.

    Maybe some more explanation on the variable importance measure proposed by Breiman for his
    Random Forests.
    First of all he defines out-of-bag (OOB) data as data which is not part of the bootstrap
    sample. The bootstrap sample drawn form the original data in turn is used to grow N decision
    trees. At each node, a randomly chosen number of attributes is taken to find the best split.

    Now the variable importance measure comes into play. For each of the trees grown in the forest,
    the OOB data is put down and the number of votes cast for the correct class is counted. Next,
    the values of variable m in the OOB cases are randomly permuted and again these cases are
    put down the tree. Finally, the number of votes for the correct class in the variable-m-permuted
    OOB data is subtracted from the number of votes for the correct class in the first untouched OOB
    data. Thus, the larger the difference, the more important this variable m is.The average of
    the differences over all trees in the forest is defined as an importance of variabble m.

    If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".
    I don't think that the approaches you've suggested can be applied
    to realize the variable importance measure as suggested to be
    most appropriate by Breiman. As far as I can see, the Weighting
    operators are independent of the used learner. But to achieve
    what I described above requires an itegration of a weighting operator
    into the RandomForest operator, i.e. while growing the forest, the
    variable importance estimation must take place. Or am I wrong and
    you see a way on how to get Breiman's approach working in RapidMiner?

    Regards,
    Paul
  • steffensteffen Member Posts: 347 Maven
    Hello Paul

    Ok, I got it now. As far as I know, the RapidMiner Implementation of RandomForest does not calculate the "variable importance measure". One the other hand, the Weka Implementation called W-RandomForest (which is also available within RapidMiner) calculates the OutOfBagError ... (http://weka.sourceforge.net/doc/weka/classifiers/trees/RandomForest.html) ... but as a total value not that helpful, I guess.

    Since the calculation of such a measurement is located deep within the learning algorithmn, it cannot be computed by combining RapidMiner Operators. I am afraid you have to dive into the code level...

    regards,

    Steffen

    PS: It would be rather interesting to know if the variable importance measure and the mentioned weighting methods are correlated...
  • marinusfrans_krmarinusfrans_kr Member Posts: 1 Contributor I

    Hi;

     

    You can use the "Weight by Tree Importance" operator on your random forest.

    See doc: https://docs.rapidminer.com/studio/operators/modeling/feature_weights/weight_by_forest.html

     

    There are some other candidates that will pop up when you do the search in the operator search bar.

     

    Regards,

    Marinus

Sign In or Register to comment.