"Correlation, weird behavior"

wessel · October 2010

[begin edit] Dear All, [end edit]

The following data has correlation: 0.999

#       sum      prediction(sum)        a1     a2      a3         
1	6.0	11.06672979160903	1.0	2.0	3.0
2	9.0	11.066728936515114	2.0	3.0	4.0
3	15.0	11.066735677516975	9.0	2.0	4.0
4	11.0	11.066728936098524	4.0	5.0	2.0
5	16.0	11.06672900369881	6.0	1.0	9.0
6	5.0	11.066728942691093	0.0	3.0	2.0
7	4.0	11.066728979026438	0.0	3.0	1.0
8	9.0	11.066728936099063	3.0	5.0	1.0
9	359.0	349.5374686083969	344.0	8.0	7.0

Is this how correlation is supposed to work?

I never knew that correlation was so much effected by outliers.

Best regards,

Wessel

land · October 2010

Hello Wessel,
what about saying hello before bursting out some statement?

Regarding your question: Yes it is. Correlation is built upon the average of the covariances which are the products from the difference of each value to it's attribute's mean value.
Or do you suggest that we have an error in the calculation routine? Then please specify the process you used and give some comparable results from another software.

Greetings,
Sebastian

wessel · October 2010

No, I'm not suggesting an error in calculation.
Just to be sure I ran the same experiment both in WEKA and in Rapid-Miner.
Both give the same results.
So no, the calculation is fine.
(Chances of Rapid-Miner being wrong are small :P,
Chances of both WEKA and rapid-miner being wrong are really small)

PerformanceVector
correlation: 0.999 
absolute_error: 22.853 +/- 5.105
PerformanceVector: root_mean_squared_error: 23.416 +/- 0.000 
[[normalized_absolute_error]]: 0.331
root_relative_squared_error: 0.213 


=== Summary ===
Correlation coefficient                  0.9994
Mean absolute error                     22.853 
Root mean squared error                 23.4161
[[Relative absolute error]]           33.0906 %
Root relative squared error             21.2978 %
Total Number of Instances                9

It seems undesirable that a performance measure is very depended on trivial things, such as outliers in the data.
So when using correlation as a performance measure, it is very important to keep this behavior in mind.
I'm thinking about a modified correlation measure that that is more robust with respect to outliers.
Simply rescaling won't do the job, because covariances are in-depended on scaling.

land · October 2010

Hi Wessel,
do you know any literature about that? It seems very likely to me, that some else already stumbled over this issue.
And you are right. One have to keep that in mind, but when you are thinking about the plot of your values, every human would assume that there's a linear dependency.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Correlation, weird behavior"

Answers