Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Variable importance measure

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-12-2008 07:02 AM

11-12-2008 07:02 AM

Hi,

in general, the learner "random forests" provides an algorithm to

measure the importance of the predicted variables and works as

follows:

"Variable importance: This is a difficult concept to

define in general, because the importance of a

variable may be due to its (possibly complex)

interaction with other variables. The random

forest algorithm estimates the importance of a

variable by looking at how much prediction error

increases when OOB (out-of-bag) data for that variable

is permuted while all others are left unchanged.

The necessary calculations are carried

out tree by tree as the random forest is

constructed."

It's described in Breiman's (developer of random forests)

paper [1] and is for example implemented in the GNU R

randomForest package. GNU R determines for each variable

its Gini index that indicate how important that variable

is for the classification. It's a very nice feature and

the results can be drawn in a bar diagram.

Is this "variable importance measure" also possible in

RapidMiner's RandomForest? I couldn't find it anywhere.

Or can the variable importance be estimated in a different

way using RapidMiner?

Thank you for your help.

Best regards,

Paul

[1] L. Breiman. Manual on setting up, using and understanding

random forests.

in general, the learner "random forests" provides an algorithm to

measure the importance of the predicted variables and works as

follows:

"Variable importance: This is a difficult concept to

define in general, because the importance of a

variable may be due to its (possibly complex)

interaction with other variables. The random

forest algorithm estimates the importance of a

variable by looking at how much prediction error

increases when OOB (out-of-bag) data for that variable

is permuted while all others are left unchanged.

The necessary calculations are carried

out tree by tree as the random forest is

constructed."

It's described in Breiman's (developer of random forests)

paper [1] and is for example implemented in the GNU R

randomForest package. GNU R determines for each variable

its Gini index that indicate how important that variable

is for the classification. It's a very nice feature and

the results can be drawn in a bar diagram.

Is this "variable importance measure" also possible in

RapidMiner's RandomForest? I couldn't find it anywhere.

Or can the variable importance be estimated in a different

way using RapidMiner?

Thank you for your help.

Best regards,

Paul

[1] L. Breiman. Manual on setting up, using and understanding

random forests.

3 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-12-2008 08:26 AM

11-12-2008 08:26 AM

Hello

As far as I understand the posted text (I admit I havent read the paper) variable importance is equivalent to the splitting criterion within the construction of a DT. In RapidMiner this parameter is named "criterion" for both RandomTree and RandomForest.

Note that accuracy and prediction error are measuring the same thing, also accuracy is the optimistic and prediction error is the negative point of view.

If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".

hope this was helpful

regards,

Steffen

As far as I understand the posted text (I admit I havent read the paper) variable importance is equivalent to the splitting criterion within the construction of a DT. In RapidMiner this parameter is named "criterion" for both RandomTree and RandomForest.

Note that accuracy and prediction error are measuring the same thing, also accuracy is the optimistic and prediction error is the negative point of view.

If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".

hope this was helpful

regards,

Steffen

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-13-2008 10:32 AM

11-13-2008 10:32 AM

Hi Steffen,

thank you for your answer.

Maybe some more explanation on the variable importance measure proposed by Breiman for his

Random Forests.

First of all he defines out-of-bag (OOB) data as data which is not part of the bootstrap

sample. The bootstrap sample drawn form the original data in turn is used to grow N decision

trees. At each node, a randomly chosen number of attributes is taken to find the best split.

Now the variable importance measure comes into play. For each of the trees grown in the forest,

the OOB data is put down and the number of votes cast for the correct class is counted. Next,

the values of variable m in the OOB cases are randomly permuted and again these cases are

put down the tree. Finally, the number of votes for the correct class in the variable-m-permuted

OOB data is subtracted from the number of votes for the correct class in the first untouched OOB

data. Thus, the larger the difference, the more important this variable m is.The average of

the differences over all trees in the forest is defined as an importance of variabble m.

I don't think that the approaches you've suggested can be applied

to realize the variable importance measure as suggested to be

most appropriate by Breiman. As far as I can see, the Weighting

operators are independent of the used learner. But to achieve

what I described above requires an itegration of a weighting operator

into the RandomForest operator, i.e. while growing the forest, the

variable importance estimation must take place. Or am I wrong and

you see a way on how to get Breiman's approach working in RapidMiner?

Regards,

Paul

thank you for your answer.

Maybe some more explanation on the variable importance measure proposed by Breiman for his

Random Forests.

First of all he defines out-of-bag (OOB) data as data which is not part of the bootstrap

sample. The bootstrap sample drawn form the original data in turn is used to grow N decision

trees. At each node, a randomly chosen number of attributes is taken to find the best split.

Now the variable importance measure comes into play. For each of the trees grown in the forest,

the OOB data is put down and the number of votes cast for the correct class is counted. Next,

the values of variable m in the OOB cases are randomly permuted and again these cases are

put down the tree. Finally, the number of votes for the correct class in the variable-m-permuted

OOB data is subtracted from the number of votes for the correct class in the first untouched OOB

data. Thus, the larger the difference, the more important this variable m is.The average of

the differences over all trees in the forest is defined as an importance of variabble m.

If you want to calculate how import a variable is outside these learning algorithms I suggest the operators "GiniIndexWeighting", "InfoGainWeighting" / "InfoGainRatioWeighting". I personally prefer "InfoGainRatio".

I don't think that the approaches you've suggested can be applied

to realize the variable importance measure as suggested to be

most appropriate by Breiman. As far as I can see, the Weighting

operators are independent of the used learner. But to achieve

what I described above requires an itegration of a weighting operator

into the RandomForest operator, i.e. while growing the forest, the

variable importance estimation must take place. Or am I wrong and

you see a way on how to get Breiman's approach working in RapidMiner?

Regards,

Paul

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-13-2008 11:06 AM

11-13-2008 11:06 AM

Hello Paul

Ok, I got it now. As far as I know, the RapidMiner Implementation of RandomForest does not calculate the "variable importance measure". One the other hand, the Weka Implementation called W-RandomForest (which is also available within RapidMiner) calculates the OutOfBagError ... (http://weka.sourceforge.net/doc/weka/classifiers/trees/RandomForest.html) ... but as a total value not that helpful, I guess.

Since the calculation of such a measurement is located deep within the learning algorithmn, it cannot be computed by combining RapidMiner Operators. I am afraid you have to dive into the code level...

regards,

Steffen

PS: It would be rather interesting to know if the variable importance measure and the mentioned weighting methods are correlated...

Ok, I got it now. As far as I know, the RapidMiner Implementation of RandomForest does not calculate the "variable importance measure". One the other hand, the Weka Implementation called W-RandomForest (which is also available within RapidMiner) calculates the OutOfBagError ... (http://weka.sourceforge.net/doc/weka/classifiers/trees/RandomForest.html) ... but as a total value not that helpful, I guess.

Since the calculation of such a measurement is located deep within the learning algorithmn, it cannot be computed by combining RapidMiner Operators. I am afraid you have to dive into the code level...

regards,

Steffen

PS: It would be rather interesting to know if the variable importance measure and the mentioned weighting methods are correlated...