🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.


Selecting samples for attributes whose values contributes the most

VanlalVanlal Member Posts: 12 Contributor II
edited September 2020 in Help
I have a attribute job which is a label and has 15 different values.
Out of 1000 samples, 7 values contributes to 950 samples and remaining 8 values contributes to 50 samples.
I want to use only the 950 samples (i.e 7 values only) and ignore the rest.
How do I select the values of the label which contributes the most to the sample?
This chosen-not chosen combination may change ( 8-7,10-5,12-3 etc) depending on the data.

I tried the following approach
1) Count number of occurrence of the values in the whole table (stuck at this point)
2) Rank the values (have no idea)
3) Filter out the chosen-not chosen values (have no idea)

If a better approach/way can be suggested , I will be very grateful

I have the following table
Name Job
John Painting
Kelly Washing
Diamond Carpentry
Clarice Carpentry
Kennedy Washing
Kevin Painting
Hart Painting
Budsey Painting
David Washing

I tried to count the number of occurrence of the values in the whole table which should look like this
Name Job Total Job
John Painting 4
Kelly Washing 3
Diamond Carpentry 2
Clarice Carpentry 2
Kennedy Washing 3
Kevin Painting 4
Hart Painting 4
Budsey Painting 4
David Washing 3

I tried Generate Aggregation but it is updating it wrong
<div><?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve job" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="repository_entry" value="../data/job"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="generate_aggregation" compatibility="9.6.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute_name" value="TotalJob"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute_filter_type" value="single"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute" value="Job"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attributes" value="Job"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_except_expression" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="value_type" value="attribute_value"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_value_type_exception" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="except_value_type" value="time"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="block_type" value="attribute_block"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_block_type_exception" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="except_block_type" value="value_matrix_row_start"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="invert_selection" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="include_special_attributes" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="aggregation_function" value="count"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="concatenation_separator" value="|"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="keep_all" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="ignore_missings" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="ignore_missing_attributes" value="false"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Retrieve job" from_port="output" to_op="Generate Aggregation" to_port="example set input"/></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Generate Aggregation" from_port="example set output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 3" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process>
The output I am getting is
RowNo Name Job TotalJob
1 John Painting 1.0
2 Kelly Washing 1.0
3 Diamond Carpentry 1.0
4 Clarice Carpentry 1.0
5 Kennedy Washing 1.0
6 Kevin Painting 1.0
7 Hart Painting 1.0
8 Budsey Painting 1.0
9 David Washing 1.0


Best Answer


  • VanlalVanlal Member Posts: 12 Contributor II
    I tried the example that you have given me, it is working but the criteria on the Job count is absolute value.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.6.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="340">
            <parameter key="excel_file" value="C:\job.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Name.true.polynominal.attribute"/>
              <parameter key="1" value="Job.true.polynominal.attribute"/>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <description align="center" color="transparent" colored="false" width="126">Change the file path under 'excel file'</description>
          <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="340"/>
          <operator activated="true" class="aggregate" compatibility="9.6.000" expanded="true" height="82" name="Aggregate" width="90" x="313" y="85">
            <parameter key="use_default_aggregation" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="default_aggregation_function" value="average"/>
            <list key="aggregation_attributes">
              <parameter key="Job" value="count"/>
            <parameter key="group_by_attributes" value="Job"/>
            <parameter key="count_all_combinations" value="false"/>
            <parameter key="only_distinct" value="false"/>
            <parameter key="ignore_missings" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">Count(job) group by job</description>
          <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply (2)" width="90" x="447" y="85"/>
          <operator activated="true" class="extract_macro" compatibility="9.6.000" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="34">
            <parameter key="macro" value="JobCount"/>
            <parameter key="macro_type" value="statistics"/>
            <parameter key="statistics" value="average"/>
            <parameter key="attribute_name" value="count(Job)"/>
            <list key="additional_macros"/>
          <operator activated="true" class="operator_toolbox:replace_rare" compatibility="2.7.000" expanded="true" height="103" name="Replace Rare Values" width="90" x="380" y="391">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Job"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="use_relative_threshold" value="true"/>
            <parameter key="relative_threshold_value" value="0.01"/>
            <parameter key="threshold" value="3"/>
            <parameter key="replacement_value" value="Other"/>
            <parameter key="replace_if_unknown" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">Replaces all values less frequent then 3 with 'Other'</description>
          <operator activated="true" class="filter_examples" compatibility="9.6.000" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="187">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="count(Job).ge.%{JobCount}"/>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">Filter Away Less Frequent Jobs</description>
          <operator activated="true" class="concurrency:join" compatibility="9.6.000" expanded="true" height="82" name="Join" width="90" x="782" y="238">
            <parameter key="remove_double_attributes" value="true"/>
            <parameter key="join_type" value="outer"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="Job" value="Job"/>
            <parameter key="keep_both_join_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">If you want that this acts as a filter: use inner.&lt;br/&gt;If you want that this adds missings for rare jobs use: outer</description>
          <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Replace Rare Values" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Replace Rare Values" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="420"/>
          <portSpacing port="sink_result 2" spacing="0"/>

    I want the filter value to be taken dynamically instead of choosing a value for them to be filtered.
    If the count was 218,150,156,90,80,40,30,20,1,21,1,11, if i take mean i.e 68.16.
    I try to take examples above this count only..(i don't know whether this approach is good or not.. Any other approach is welcomed)
    So i extract a macro JobCount to take the mean of count of the job and use this for the Filter Example
    Replace Rare values threshold value cannot be set to this macro value.

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,159  RM Data Scientist
    Hi @Vanlal ,
    I like your solution and this is what I would go for.

    I think your problem with Replace Rare Values is, that it wants you to add a integer value, but your value is a real value. So you would need to add another Generate Macro to round the value. I think Aggregate, ExtractMacro, Filter Example and Join is the way to go. This is very similar to what you would do in a SQL-Query.

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.