"Important factors for prediction": how do you work?

up201712146up201712146 Member Posts: 1 Newbie
edited June 2019 in Help

I'm using Random Forest and Boosted Trees from AutoModel to prioritize the variables I'll use in modeling with neuralnetworks. So, for me, it's very important. So, for me, it is essential to know the "importance" of each dependent variable. As a result, AutoModel provides "Important factors for prediction", but I don't no how its works. I think is based in correlation but, in this case, should be independent of the type of modeling, but for Random Forest and Boosted Trees different results are generated. And more, before and after optimization, different results are generated to.

My question is: how is the importance of factors calculated?

Thank you.


  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @up201712146,

    the variable importance is calculated by each model in its own way. For example, the Random Forest has trees, which contain an attribute or don't, on different positions inside the tree. A variable with good predictive power will end up in more trees in a more prominent position.

    A linear regression model would look at the standardized coefficients etc.

    There are "Weight by ..." operators that can give you variable importances based on correlation, information gain, gain ratio etc. These might be similar to the weights you're getting from your models but they're not the same.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Hi @up201712146, the best approach would be determined by the specific question you are trying to answer.
    As @BalazsBarany said, the various "Weight by" operators (e.g., Weight by Correlation, Weight by Information Value) are good for finding the univariate strength of relationships between individual attributes and your label.  However, that does not mean those are the most important in a multivariate model context because of the potential overlap of information (e.g., multicollinearity in the linear regression context).  Likewise, the "variable importance" measures that are provided by individual machine learning operators do not necessarily show you the attributes with the strongest individual relationships with the label, but rather those in the context of that specific model with the other attributes that are available.  This is an important distinction to keep in mind.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.