Selecting samples for attributes whose values contributes the most

Vanlal · September 2020

I have a attribute job which is a label and has 15 different values.
Out of 1000 samples, 7 values contributes to 950 samples and remaining 8 values contributes to 50 samples.
I want to use only the 950 samples (i.e 7 values only) and ignore the rest.
How do I select the values of the label which contributes the most to the sample?
This chosen-not chosen combination may change ( 8-7,10-5,12-3 etc) depending on the data.

I tried the following approach
1) Count number of occurrence of the values in the whole table (stuck at this point)
2) Rank the values (have no idea)
3) Filter out the chosen-not chosen values (have no idea)

If a better approach/way can be suggested , I will be very grateful

I have the following table

Name	Job
John	Painting
Kelly	Washing
Diamond	Carpentry
Clarice	Carpentry
Kennedy	Washing
Kevin	Painting
Hart	Painting
Budsey	Painting
David	Washing

I tried to count the number of occurrence of the values in the whole table which should look like this

Name	Job	Total Job
John	Painting	4
Kelly	Washing	3
Diamond	Carpentry	2
Clarice	Carpentry	2
Kennedy	Washing	3
Kevin	Painting	4
Hart	Painting	4
Budsey	Painting	4
David	Washing	3

I tried Generate Aggregation but it is updating it wrong

<div><?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve job" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="repository_entry" value="../data/job"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="generate_aggregation" compatibility="9.6.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute_name" value="TotalJob"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute_filter_type" value="single"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attribute" value="Job"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="attributes" value="Job"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_except_expression" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="value_type" value="attribute_value"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_value_type_exception" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="except_value_type" value="time"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="block_type" value="attribute_block"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="use_block_type_exception" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="except_block_type" value="value_matrix_row_start"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="invert_selection" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="include_special_attributes" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="aggregation_function" value="count"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="concatenation_separator" value="|"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="keep_all" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="ignore_missings" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="ignore_missing_attributes" value="false"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Retrieve job" from_port="output" to_op="Generate Aggregation" to_port="example set input"/></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Generate Aggregation" from_port="example set output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 3" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process>
</div>

The output I am getting is
RowNo Name Job TotalJob

1	John	Painting	1.0
2	Kelly	Washing	1.0
3	Diamond	Carpentry	1.0
4	Clarice	Carpentry	1.0
5	Kennedy	Washing	1.0
6	Kevin	Painting	1.0
7	Hart	Painting	1.0
8	Budsey	Painting	1.0
9	David	Washing	1.0

MartinLiebig · September 2020

Hi @Vanlal ,

you were on the right track. If you want to you can use Aggregate and calculate count(Job) for every Job. The trick is then to join this with your original table.

Maybe more simple is to use the Replace Rare Values operator, which is part of the operator toolbox extension. It just allows you replace every value which is less frequent than X with a value like Other. You can then just filter on it. The two options look like this:

And here is the process for you to copy (after you installed Operator Toolbox, otherwise you cannot use Replace Rare).

Best,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.7.002">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.7.002" expanded="true" height="68" name="Read Excel" width="90" x="112" y="340">
        <parameter key="excel_file" value="C:\Users\MartinSchmitz\Downloads\job.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Name.true.polynominal.attribute"/>
          <parameter key="1" value="Job.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <description align="center" color="transparent" colored="false" width="126">Change the file path under 'excel file'</description>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.7.002" expanded="true" height="103" name="Multiply" width="90" x="246" y="340"/>
      <operator activated="true" class="operator_toolbox:replace_rare" compatibility="2.7.000-SNAPSHOT" expanded="true" height="103" name="Replace Rare Values" width="90" x="380" y="493">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Job"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="use_relative_threshold" value="false"/>
        <parameter key="relative_threshold_value" value="0.01"/>
        <parameter key="threshold" value="3"/>
        <parameter key="replacement_value" value="Other"/>
        <parameter key="replace_if_unknown" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Replaces all values less frequent then 3 with 'Other'</description>
      </operator>
      <operator activated="true" class="aggregate" compatibility="9.7.002" expanded="true" height="82" name="Aggregate" width="90" x="380" y="238">
        <parameter key="use_default_aggregation" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="default_aggregation_function" value="average"/>
        <list key="aggregation_attributes">
          <parameter key="Job" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Job"/>
        <parameter key="count_all_combinations" value="false"/>
        <parameter key="only_distinct" value="false"/>
        <parameter key="ignore_missings" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Count(job) group by job</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.7.002" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="85">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count(Job).gt.3"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Filter Away Less Frequent Jobs</description>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.7.002" expanded="true" height="82" name="Join" width="90" x="648" y="238">
        <parameter key="remove_double_attributes" value="true"/>
        <parameter key="join_type" value="outer"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Job" value="Job"/>
        </list>
        <parameter key="keep_both_join_attributes" value="false"/>
        <description align="center" color="transparent" colored="false" width="126">If you want that this acts as a filter: use inner.<br/>If you want that this adds missings for rare jobs use: outer</description>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Replace Rare Values" to_port="example set input"/>
      <connect from_op="Replace Rare Values" from_port="example set output" to_port="result 2"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="420"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
</operator>
</process>

Vanlal · September 2020

I tried the example that you have given me, it is working but the criteria on the Job count is absolute value.

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.6.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="340">
        <parameter key="excel_file" value="C:\job.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Name.true.polynominal.attribute"/>
          <parameter key="1" value="Job.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <description align="center" color="transparent" colored="false" width="126">Change the file path under 'excel file'</description>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="340"/>
      <operator activated="true" class="aggregate" compatibility="9.6.000" expanded="true" height="82" name="Aggregate" width="90" x="313" y="85">
        <parameter key="use_default_aggregation" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="default_aggregation_function" value="average"/>
        <list key="aggregation_attributes">
          <parameter key="Job" value="count"/>
        </list>
        <parameter key="group_by_attributes" value="Job"/>
        <parameter key="count_all_combinations" value="false"/>
        <parameter key="only_distinct" value="false"/>
        <parameter key="ignore_missings" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Count(job) group by job</description>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply (2)" width="90" x="447" y="85"/>
      <operator activated="true" class="extract_macro" compatibility="9.6.000" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="34">
        <parameter key="macro" value="JobCount"/>
        <parameter key="macro_type" value="statistics"/>
        <parameter key="statistics" value="average"/>
        <parameter key="attribute_name" value="count(Job)"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="operator_toolbox:replace_rare" compatibility="2.7.000" expanded="true" height="103" name="Replace Rare Values" width="90" x="380" y="391">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Job"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="use_relative_threshold" value="true"/>
        <parameter key="relative_threshold_value" value="0.01"/>
        <parameter key="threshold" value="3"/>
        <parameter key="replacement_value" value="Other"/>
        <parameter key="replace_if_unknown" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Replaces all values less frequent then 3 with 'Other'</description>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.6.000" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="187">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="count(Job).ge.%{JobCount}"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
        <description align="center" color="transparent" colored="false" width="126">Filter Away Less Frequent Jobs</description>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.6.000" expanded="true" height="82" name="Join" width="90" x="782" y="238">
        <parameter key="remove_double_attributes" value="true"/>
        <parameter key="join_type" value="outer"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Job" value="Job"/>
        </list>
        <parameter key="keep_both_join_attributes" value="false"/>
        <description align="center" color="transparent" colored="false" width="126">If you want that this acts as a filter: use inner.&lt;br/&gt;If you want that this adds missings for rare jobs use: outer</description>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Replace Rare Values" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Replace Rare Values" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="420"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I want the filter value to be taken dynamically instead of choosing a value for them to be filtered.
If the count was 218,150,156,90,80,40,30,20,1,21,1,11, if i take mean i.e 68.16.
I try to take examples above this count only..(i don't know whether this approach is good or not.. Any other approach is welcomed)
So i extract a macro JobCount to take the mean of count of the job and use this for the Filter Example
Replace Rare values threshold value cannot be set to this macro value.

MartinLiebig · September 2020

Hi @Vanlal ,

I like your solution and this is what I would go for.

I think your problem with Replace Rare Values is, that it wants you to add a integer value, but your value is a real value. So you would need to add another Generate Macro to round the value. I think Aggregate, ExtractMacro, Filter Example and Join is the way to go. This is very similar to what you would do in a SQL-Query.

Best,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Selecting samples for attributes whose values contributes the most

Best Answer

Answers