Automodeling and there Results>General>Correlations

CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II
edited December 2018 in Help

I have a few numerical variables and the categorical variable with categories A,B,C,D. Why the correlation matrix calculates only the correlation for B,C,D versus other variables but "skips" category A?

Tagged:

Best Answer

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    You can actually check the process yourself and see exactly how the correlation is calculated.  This is the beauty of our Auto Modeling approach.  There are no black boxes, just select "Correlations" in the results and then click on the "Open Process" button at the bottom of the screen.  RapidMiner will show you the complete process how the correlations are calculated.

     

    If you do this, you will see that there is a part of the process which handles the nominal (or "categorical") columns in your data.  Double click on this subprocess and you will jump into this.  And if you dive deeper and deeper, you will see that there is an operator called "Nominal to Numerical" which performs a dummy encoding using the least frequent category as comparison group.  This is why the A in your example is gone: it became a comparison group (also known as "reference category") for the others.

     

    If you want to learn more about dummy coding and comparison groups / reference categories, here are two links which might be useful:

     

    https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/

    https://www.theanalysisfactor.com/strategies-dummy-coding/

     

    Hope this helps,

    Ingo

Sign In or Register to comment.