Remove attributes with missing values exceeding a given threshold (percentage)

f_lapernaf_laperna Member Posts: 13 Contributor II
edited September 2019 in Help

Hi, I'm new to Rapid Miner. I'm trying to do something very simple but I'm stuck with it. Given my data collection with many attributes I want to remove columns in which there are more than a given percentage of missing values (because I would not be able to use fixed values or infer their values). I tried the Remove Useless Attributes node but still I have columns with almost 90% of missing values so it didn't work as I wanted. Can you help me achieve what I want? It should be something trivial, I remember in Knime there was a specific option in the filter node to specify the percentage threshold.

 

Thank you!

Answers

  • FBTFBT Member Posts: 106 Unicorn

    There are probably a few different ways of doing it, but the easiest I can come up with is using the "Remove Useless Attributes" operator. Please take a look at the example process below (just copy it and paste it into your XML panel, then click the green checkmark):

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Golf" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="136">
    <list key="function_descriptions">
    <parameter key="Missing Value" value="if([Outlook] == &quot;sunny&quot;,1, MISSING_NOMINAL)"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Outlook|Wind"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" breakpoints="after" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="first 10 examples" width="90" x="447" y="136">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="10"/>
    </operator>
    <operator activated="true" class="remove_useless_attributes" compatibility="7.6.001" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="581" y="136">
    <parameter key="nominal_useless_above" value="0.6"/>
    <parameter key="nominal_useless_below" value="0.5"/>
    </operator>
    <connect from_op="Golf" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="first 10 examples" to_port="example set input"/>
    <connect from_op="first 10 examples" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="90"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • mortizmortiz Member Posts: 20 Maven
    I have the same question. If I have 100 attributes and 20 of them are missing 60% of the values, how can I easily scrub them out? The remove useless attribute operator doesn't seem to help with this.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @mortiz,

    It is very easy with TURBO PREP : 

     - Open your dataset with Turbo Prep
     - Click on CLEANSE
     - Click on REMOVE LOW QUALITY
     - Set the Max missing (%)
     - Click on COMMIT CLEANSE 

    Hope this helps,

    Regards,

    Lionel
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi again @mortiz,

    If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
    To execute this process, you will need to : 

     - Install Python on your computer.
     - Install the Python Scripting extension from the MarketPlace.
     - Set the  Max Missing (%) values in a attribute (for this set the threshold called thr in the Set Macros operator).

    The Process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.4.000-BETA" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Att_1,Att_2&#10;1,1&#10;2,2&#10;3,3&#10;4,4&#10;5,5&#10;6,&#10;7,&#10;8,&#10;9,&#10;10,"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="set_macros" compatibility="9.4.000-BETA" expanded="true" height="82" name="Set Macros" width="90" x="246" y="85">
            <list key="macros">
              <parameter key="thr" value="0.6"/>
            </list>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.001" expanded="true" height="103" name="Execute Python" width="90" x="380" y="85">
            <parameter key="script" value="import pandas as pd&#10;&#10;threshold = %{thr}&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.dropna(thresh=threshold*len(data), axis=1)&#10;    &#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
            <parameter key="notebook_cell_tag_filter" value=""/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Set Macros" to_port="through 1"/>
          <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Regards,

    Lionel



  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi @Moritz, @lionelderkrikor,

    there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.

    BR
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mortizmortiz Member Posts: 20 Maven
    Thank you for the help. The Turbo Prep option seemed to help
Sign In or Register to comment.