Applying attribute elimination to original data

christopher_schchristopher_sch Member Posts: 1 Contributor I
edited November 2018 in Help

I have a large dataset that has been tokenized.  Many of the token attributes capture identical information, so I need to eliminate some variables that have 100% correlation.

 

Because the dataset is large, I'd like to perform "Remove Correlated Attributes" on a sample, rather than the original, then apply the results from the sample back to the original (eliminating about 1,000 attributes from the original in the process).

 

What's the best way to do this?  I've been messing around with the "Work on Subset" operator, but it seems to only want to pull the sample back without applying the attribute removal to the original. 

 

Thanks for any insight.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello @christopher_sch - welcome to the community.  Seems to me that you should try optimizing via Feature Selection.  There are some nice tutorials on how to do this in those operators.


    Scott

     

  • earmijoearmijo Member Posts: 270 Unicorn

    If I understood your question correctly, you want:

     

    1) take a sample of the entire dataset.

    2) find variables that are highly correlated 

    3) drop them

    4) save the names of the variables that survived step 3

    5) load the entire dataset

    6) take only variables in step 4

     

    I think you can do that with a combination of "Remove Correlated Attributes" and "Data to Weights".

     

    In the example below, I split the sample dataset Sonar in two: 

     

    a) first 50 obs

    b) obs 51 to 208

     

    I use the first 50 obs to find correlated attributes (correlation > 0.7) and drop one of each pair. I save the weights of the variables that remained in the dataset (using Data to Weights). I then use these weights to filter the second part of the dataset.

     

    The program of course could be split into two processes: 

    1) Find the weights and save them

    2) Apply weights to entire dataset.

     

    Program Below

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="289">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
    <operator activated="true" class="filter_example_range" compatibility="7.5.003" expanded="true" height="82" name="Filter Example Range" width="90" x="246" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="50"/>
    </operator>
    <operator activated="true" class="remove_correlated_attributes" compatibility="7.5.003" expanded="true" height="82" name="Remove Correlated Attributes" width="90" x="380" y="34">
    <parameter key="correlation" value="0.7"/>
    </operator>
    <operator activated="true" class="data_to_weights" compatibility="7.5.003" expanded="true" height="82" name="Data to Weights" width="90" x="514" y="34"/>
    <operator activated="true" class="filter_example_range" compatibility="7.5.003" expanded="true" height="82" name="Filter Example Range (2)" width="90" x="246" y="187">
    <parameter key="first_example" value="51"/>
    <parameter key="last_example" value="208"/>
    </operator>
    <operator activated="true" class="select_by_weights" compatibility="7.5.003" expanded="true" height="103" name="Select by Weights" width="90" x="447" y="187"/>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range (2)" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Remove Correlated Attributes" to_port="example set input"/>
    <connect from_op="Remove Correlated Attributes" from_port="example set output" to_op="Data to Weights" to_port="example set"/>
    <connect from_op="Data to Weights" from_port="weights" to_op="Select by Weights" to_port="weights"/>
    <connect from_op="Filter Example Range (2)" from_port="example set output" to_op="Select by Weights" to_port="example set input"/>
    <connect from_op="Select by Weights" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

Sign In or Register to comment.