Variable importance measure

Hi,

in general, the learner "random forests" provides an algorithm to
measure the importance of the predicted variables and works as
follows:

"Variable importance: This is a difficult concept to
define in general, because the importance of a
variable may be due to its (possibly complex)
interaction with other variables. The random
forest algorithm estimates the importance of a
variable by looking at how much prediction error
increases when OOB (out-of-bag) data for that variable
is permuted while all others are left unchanged.
The necessary calculations are carried
out tree by tree as the random forest is
constructed."

It's described in Breiman's (developer of random forests)
paper [1] and is for example implemented in the GNU R
randomForest package. GNU R determines for each variable
its Gini index that indicate how important that variable
is for the classification. It's a very nice feature and
the results can be drawn in a bar diagram.

Is this "variable importance measure" also possible in
RapidMiner's RandomForest? I couldn't find it anywhere.
Or can the variable importance be estimated in a different
way using RapidMiner?

Best regards,
Paul

[1] L. Breiman. Manual on setting up, using and understanding
random forests.
3 REPLIES
Regular Contributor

Re: Variable importance measure

Hello

As far as I understand the posted text (I admit I havent read the paper) variable importance is equivalent to the splitting criterion within the construction of a DT. In RapidMiner this parameter is named "criterion" for both RandomTree and RandomForest.

Note that accuracy and prediction error are measuring the same thing, also accuracy is the optimistic and prediction error is the negative point of view.

If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".

regards,

Steffen

Re: Variable importance measure

Hi Steffen,

Maybe some more explanation on the variable importance measure proposed by Breiman for his
Random Forests.
First of all he defines out-of-bag (OOB) data as data which is not part of the bootstrap
sample. The bootstrap sample drawn form the original data in turn is used to grow N decision
trees. At each node, a randomly chosen number of attributes is taken to find the best split.

Now the variable importance measure comes into play. For each of the trees grown in the forest,
the OOB data is put down and the number of votes cast for the correct class is counted. Next,
the values of variable m in the OOB cases are randomly permuted and again these cases are
put down the tree. Finally, the number of votes for the correct class in the variable-m-permuted
OOB data is subtracted from the number of votes for the correct class in the first untouched OOB
data. Thus, the larger the difference, the more important this variable m is.The average of
the differences over all trees in the forest is defined as an importance of variabble m.

`If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".`

I don't think that the approaches you've suggested can be applied
to realize the variable importance measure as suggested to be
most appropriate by Breiman. As far as I can see, the Weighting
operators are independent of the used learner. But to achieve
what I described above requires an itegration of a weighting operator
into the RandomForest operator, i.e. while growing the forest, the
variable importance estimation must take place. Or am I wrong and
you see a way on how to get Breiman's approach working in RapidMiner?

Regards,
Paul
Regular Contributor

Re: Variable importance measure

Hello Paul

Ok, I got it now. As far as I know, the RapidMiner Implementation of RandomForest does not calculate the "variable importance measure". One the other hand, the Weka Implementation called W-RandomForest (which is also available within RapidMiner) calculates the OutOfBagError ... (http://weka.sourceforge.net/doc/weka/classifiers/trees/RandomForest.html) ... but as a total value not that helpful, I guess.

Since the calculation of such a measurement is located deep within the learning algorithmn, it cannot be computed by combining RapidMiner Operators. I am afraid you have to dive into the code level...

regards,

Steffen

PS: It would be rather interesting to know if the variable importance measure and the mentioned weighting methods are correlated...