Correlation Matrix

macctenmaccten Member Posts: 28 Contributor II
edited June 2019 in Help
Hi,

I have a large data set with many attributes
I would like to see how closely the attributes are correlated but because of the sheer number of them I'm only interested in attributes that are correlated about 40%
Is there a way to do this for example using a filter of some description. I know you can remove correlated attributes and select by weights but are not what i need as im interested in the high correlations

Thank you for your time
Tagged:

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    There are options like "top k" and "top p%" in the Select by Weights operator that might help.

    regards

    Andrew
  • macctenmaccten Member Posts: 28 Contributor II
    Hi Andrew

    Thanks for the quick reply. I ran it this morning but i don't think this is what I'm looking for
    What i need is the pairwise table so i can specifically say there is a 50% correlation between Attribute A and B but a Negative correalation between A and C
    Do you know if you can filter the actual matrix?

    Thanks
  • macctenmaccten Member Posts: 28 Contributor II
    Hi All

    Is there perhaps a method to export the pairwise table into a CSV file or generate a report based off of it?
    Has anyone tried it before
    If it was in a database it would be simple case of selecting the rows where the correlation is above a certain amount

    Thanks
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    A groovy script would be able to do it. I could probably do that in return for beer or money  ;D

    Alternatively, I'm having a think about the possibility of calculating the correlation in a process without using the built in operators. That way would let you make an example set that could be filtered as you like.

    regards

    Andrew
  • macctenmaccten Member Posts: 28 Contributor II
    I thought this link provided the answer http://www.myexperiment.org/workflows/1279.html

    But unfortunately, it doesn't provide a pairwise table and the matrix in question is 5000 attributes in scope so exporting it to excel means cutting off a good portion of it

    Il keep the beer money in mind of course :), as soon as the next pay check comes around
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Have a look at the configuration of the Report operator: you should be able to configure Pairwise Table as output format.

    Have a look at process below:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="5.3.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
          <operator activated="true" class="correlation_matrix" compatibility="5.3.008" expanded="true" height="94" name="Correlation Matrix" width="90" x="179" y="30"/>
          <operator activated="true" class="reporting:generate_report" compatibility="5.3.000" expanded="true" height="76" name="Generate Report" width="90" x="313" y="30">
            <parameter key="report_name" value="test"/>
            <parameter key="format" value="Excel"/>
            <parameter key="excel_output_file" value="C:\Users\jdoe\Desktop\test.xls"/>
          </operator>
          <operator activated="true" class="reporting:report" compatibility="5.3.000" expanded="true" height="60" name="Report" width="90" x="447" y="30">
            <parameter key="report_name" value="test"/>
            <parameter key="specified" value="true"/>
            <parameter key="reportable_type" value="Numerical Matrix"/>
            <parameter key="renderer_name" value="Pairwise Table"/>
            <list key="parameters">
              <parameter key="min_row" value="1"/>
              <parameter key="max_row" value="2147483647"/>
              <parameter key="min_column" value="1"/>
              <parameter key="max_column" value="2147483647"/>
            </list>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Correlation Matrix" to_port="example set"/>
          <connect from_op="Correlation Matrix" from_port="matrix" to_op="Generate Report" to_port="through 1"/>
          <connect from_op="Generate Report" from_port="through 1" to_op="Report" to_port="reportable in"/>
          <connect from_op="Report" from_port="reportable out" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • macctenmaccten Member Posts: 28 Contributor II
    Hi Marius

    This works :)
    However i have one last problem in relation to this
    My pair wise table is going to generate roughly 25 million rows which is not exportable using a report
    Is there anyway to filter the matrix/pairwise table so that say only attributes with a certain correlation are exported for example only return attributes with 50% or more correlation?

    Thanks
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Unfortunately, this is not possible. To solve the problem once and forever, we have an internal ticket requesting to convert the matrix into a normal example set, but we don't have a schedule for it yet.
  • macctenmaccten Member Posts: 28 Contributor II
    Thanks Marius ver much for the feedback
Sign In or Register to comment.