"Correlation, weird behavior"

wesselwessel Member Posts: 537 Maven
edited May 2019 in Help
[begin edit] Dear All,  [end edit]

The following data has correlation: 0.999
#       sum      prediction(sum)        a1     a2      a3         
1 6.0 11.06672979160903 1.0 2.0 3.0
2 9.0 11.066728936515114 2.0 3.0 4.0
3 15.0 11.066735677516975 9.0 2.0 4.0
4 11.0 11.066728936098524 4.0 5.0 2.0
5 16.0 11.06672900369881 6.0 1.0 9.0
6 5.0 11.066728942691093 0.0 3.0 2.0
7 4.0 11.066728979026438 0.0 3.0 1.0
8 9.0 11.066728936099063 3.0 5.0 1.0
9 359.0 349.5374686083969 344.0 8.0 7.0
Is this how correlation is supposed to work?

I never knew that correlation was so much effected by outliers.

Best regards,



  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hello Wessel,
    what about saying hello before bursting out some statement?

    Regarding your question: Yes it is. Correlation is built upon the average of the covariances which are the products from the difference of each value to it's attribute's mean value.
    Or do you suggest that we have an error in the calculation routine? Then please specify the process you used and give some comparable results from another software.

  • Options
    wesselwessel Member Posts: 537 Maven
    No, I'm not suggesting an error in calculation.
    Just to be sure I ran the same experiment both in WEKA and in Rapid-Miner.
    Both give the same results.
    So no, the calculation is fine.
    (Chances of Rapid-Miner being wrong are small :P,
    Chances of both WEKA and rapid-miner being wrong are really small)
    correlation: 0.999
    absolute_error: 22.853 +/- 5.105
    PerformanceVector: root_mean_squared_error: 23.416 +/- 0.000
    [[normalized_absolute_error]]: 0.331
    root_relative_squared_error: 0.213

    === Summary ===
    Correlation coefficient                  0.9994
    Mean absolute error                     22.853
    Root mean squared error                 23.4161
    [[Relative absolute error]]           33.0906 %
    Root relative squared error             21.2978 %
    Total Number of Instances                9    
    It seems undesirable that a performance measure is very depended on trivial things, such as outliers in the data.
    So when using correlation as a performance measure, it is very important to keep this behavior in mind.
    I'm thinking about a modified correlation measure that that is more robust with respect to outliers.
    Simply rescaling won't do the job, because covariances are in-depended on scaling.

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Wessel,
    do you know any literature about that? It seems very likely to me, that some else already stumbled over this issue.
    And you are right. One have to keep that in mind, but when you are thinking about the plot of your values, every human would assume that there's a linear dependency.

Sign In or Register to comment.