RapidMiner

Gradient Boosted Trees: extract feature importance?

by Marius_Helf ‎08-24-2016 09:15 AM

Hi all,

 

is it possible to extract the feature importance from the GradientBoosted model?

Most comfortable would be a weights output on the operator in one of the next releases, but I'm sure it must also be possible with some Groovy code?

 

Unfotunately the description result of the model shows only the ~10 most/least important features, which is not enough if you have many features.

 

Cheers,

Marius

Comments
RMStaff
RMStaff

Hi Marius,

Good point, this is in fact something we are considering for one of our upcoming releases. I just raised the priority in our tracking system so hopefully it will make it into a release very soon.

Best, Zoltan

Marius_Helf
Super Contributor

That's good news, thanks! Hope "one of next" means soon Smiley Happy

 

Cheers,

Marius

RMStaff
RMStaff

Hi Marius,

 

this one is already part of the upcoming 7.3 release.

Logistic Regression, Gradient Boosted Trees and Generalized Linear Model all provide an attribute weights vector output.

 

(Extracting those with Groovy scripts is not possible due to security restrictions, if nothing else.)

 

Best,

Peter

Elite II

@phellinger and @zprekopcsak, perhaps a more generalized operator to extract attribute importance weights from any model would be even more helpful?  I know that there is not necessarily a single definition of how to determine attribute importance inside a multivariate model, but in theory, one simple approach is to take the list of all model attributes and remove them one at a time from the final model to see the resulting deterioration in model performance, and rank them accordingly (where the attribute that leads to the greatest decrease in performance has the highest weight, and all other attributes' weights are scaled to that).  This can of course be done manually today (and even done with loops to cut down on repetitive operations), but it would be nice if RapidMiner added an operator to do this automatically for any model and output the resulting table as a set of weights.  In my view this would answer a very common question from business users about attribute/variable importance in multivariate models.

 

RMStaff
RMStaff

@Telcontar120: sounds like an interesting idea, but might be misleading. If you have two highly correlated attributes then    removing one will not change performance at all, even though one of them may be needed for a good model.

Otherwise, I agree that explaining and adding narrative to a model is very important and we are considering various ways to improve there.