Selecting top 15 values

grandoniagrandonia Member Posts: 9 Contributor II
edited November 2018 in Help

Dears, I'm working on a file here and I would like to work only with the top 15 states (data on index 1-15). How can I do this in a more automatic way. Nowadays I'm using the filter examples, but this is time consuming if you wanna select the top 100 items for example. Is there a way to do this faster?

 

 

top states.png

 

Thanks

Hannes

Best Answer

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    That's possible.  Here is what I would do:

     

    1. Aggregate, group by the State and count the States
    2. Sort the result with decreasing count
    3. Filter Example Range to keep the Top 15 (or whatever number)
    4. Select Attributes to only keep the Top 15 State names
    5. Join this data set with an inner join back to the original data set --> here is your desired result of all rows of those top 15 states

     

    Below is a process showing this on the Titanic data set using the Life Boat instead of State but the idea is the same:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.2.001" expanded="true" height="82" name="Aggregate" width="90" x="179" y="85">
    <list key="aggregation_attributes">
    <parameter key="Life Boat" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="Life Boat"/>
    </operator>
    <operator activated="true" class="sort" compatibility="7.2.001" expanded="true" height="82" name="Sort" width="90" x="313" y="34">
    <parameter key="attribute_name" value="count(Life Boat)"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="7.2.001" expanded="true" height="82" name="Filter Example Range" width="90" x="447" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="5"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Life Boat"/>
    </operator>
    <operator activated="true" class="join" compatibility="7.2.001" expanded="true" height="82" name="Join" width="90" x="715" y="85">
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="Life Boat" value="Life Boat"/>
    </list>
    </operator>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
    <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="63"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Cheers,

    Ingo

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You can do it this way. Use a Sort operator to Sort Descending and then Filter Example Range from 1 to 15.

  • grandoniagrandonia Member Posts: 9 Contributor II

    Hi Thomas, thanks for the quick reply... this didn't work. I have millions of rows.... what I would like to filter out are the descriptive statistics (histogram) of 15 top states. Not just the top 15 rows... if you look at index 1, the nominal value is SP (a state) and this state has 396.995 rows... I would like to select those rows + 14 more states (which probable will give me more than 1,5 million rows.... hope I'm clearer now :))

  • grandoniagrandonia Member Posts: 9 Contributor II

    Wow Ingo, you rock!!

     

    I had over 1 million data points, tried loading them into other tools like Orange and Knime... but... come one... no way... Rapidminer rocks! What a fast tool you developed. Love it :) But still lots to learn :)

Sign In or Register to comment.