The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
"Correlation Matrix When to use Squared Correlation"
While researching a project involving polynominal datasets I forgot to check if Rapidminer had an operator to help so I'm a bit confused by the Correlation Matrix operator and when to use the "squared correlation"
Is the squared correlation the same as a chi-squared calculation and so is the correlation matrix similar to the "weight by chi-square" but without the need to have a class label defined ?
The tutorial example for the correlation matrix appears to show its suitable for use with the default params with non numeric data but other tools like R seem to prefer only numeric datasets so I'm a bit confused on how to handle non-numeric datasets in RM when I need to see the correlation
Any pointers to help clear the fog?
Is the squared correlation the same as a chi-squared calculation and so is the correlation matrix similar to the "weight by chi-square" but without the need to have a class label defined ?
The tutorial example for the correlation matrix appears to show its suitable for use with the default params with non numeric data but other tools like R seem to prefer only numeric datasets so I'm a bit confused on how to handle non-numeric datasets in RM when I need to see the correlation
Any pointers to help clear the fog?
Tagged:
0
Answers
as far as i know squared correlation is aquivalent to R² in Excel.
Does this help?
Cheers,
Martin
Dortmund, Germany
Thanks for helping. If you are talking about the rsq() function in excel that "can be interpreted as the proportion of the variance in y attributable to the variance in x." according to the Excel help docs. The excel function isn't suitable for non-numeric data
Is RM able to process non-numeric data to see if attributes are related or do I need to convert them and if so how do i do that so I don't loose the essence of the relationships between categorical attributes?
i think what you want is not possible with a single operator, you need to use a loop here. Attached is a process calculating such a matrix (as a list) using Gini Index. You can use any other Weight by Operator if you want to. Comments are inside the process
~Martin
Dortmund, Germany
so: not really
Dortmund, Germany
How do you want to use it? Ofc. you can delete coloumns in a iteration so it is not tested anymore in the next iteration. That might make everything faster.
Edit: If you want to use it for feature selection, have a look on this extension: http://sourceforge.net/projects/rm-featselext/
The MRMR operator there might be useful. Sadly this is not on the RM Market Place.
Dortmund, Germany
The other option would be to use Nominal to numerical and dummy coding. But i think pearson correlation is "wrong" for a binominal (numerical) attribute.
Dortmund, Germany
I went crazy trying the dummy variables route. I'll check out the operators you suggest