Dealing with collinearity

chris92chris92 Member Posts: 6 Contributor I
edited December 2018 in Help

Hi, 

I am having difficulties with my data. I have 96 attributes and I need to complete a scientifically robust method for checking collinearity between the attributes. I have been fiddling with the 'remove correlated attributes' operator. 

 

I have a few questions pertaining to this:

a) In the situation where you have 3 attributes and 2 are highly correlated to the third what is the criteria in which this operator selects an attribute? 

 

b) I want to remove attributes which have a correlation equal to or greater than 0.75. But, this needs to apply to both positive and negative correlations meaning if a correlation is equal to -0.83 I need this to be removed also. How can I get this operator to apply these requirements?

 

If there are any suggestions of better methods that Rapidminer is capable of for dealing with collinearity I would also appreciate any further suggestions.

 

Thanks,

Chris

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @chris92 - if it were me, I would begin with the Correlation Matrix operator.  You will see all your r values in a nice chart which will help a lot.  In VERY general stats terms, the higher the abs(r), the higher the correlation.  Lots of stats materials will explain this very well.  As for negative values, you generally use r^2 which is conveniently a parameter of the operator.  :)

     

    Scott

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you turn on expert parameters, "Remove Correlated Attributes" will actually handle both of your questions as well.  

    First there is a parameter to use absolute correlations, which handles the positive and negative values (it is on by default).   

    There is also a parameter to specify which attribute is kept when it finds a set of correlated attributes.  Your options are based on the order that the attributes appear in your dataset, and you can choose to keep the first, the last, or a random one.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • chris92chris92 Member Posts: 6 Contributor I

    Thanks very much for the speedy responses. 

     

    I have one follow up question, is there a different operator that will select which attribute is removed based on its correlation to a target variable rather than selecting original, random or reverse options? In my problem I do not want to remove potentially important attributes. So, essentially I need an operator that identifies correlated attributes and then based upon which attribute has a stronger relationship with the outcome variable the attribute that provides the weakest relationship would then be removed. Does such an operator exist? 

     

    Thanks,

    Chris

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,497 RM Data Scientist

    Dear Chris,

     

    first of all you should think about if you are really interested in collinearity or in dependencies. Usual data science tasks are not linear. So why do you want to focus on linear assumptions?

     

    Second, have a look at: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Feature-Weighting-Tutorial/ta-p/35281 it gives you quite some options.

     

    My proposal: Weights of Logisitc or Linerar Regressions.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    @Telcontar120 wrote:

     

    There is also a parameter to specify which attribute is kept when it finds a set of correlated attributes.  Your options are based on the order that the attributes appear in your dataset, and you can choose to keep the first, the last, or a random one.

     


    Hi @Telcontar120

    Is there such? I couldn't find it neither in LogReg nor in GLM operators. 

    So next question, what exactly column left by default when others are removed? 

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @kypexin The parameters I mentioned are in the "Remove Correlated Attributes" operator which I usually run before modeling.  I think you are right, there are not similar options in the built-in functions inside some of the ML algorithms.  That's one good reason to run it separately first, if you want more control over what is kept.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.