"[SOLVED] Filter attributes against whitelist"

mataiomataio Member Posts: 6 Contributor I
edited June 2019 in Help
Hello everybody,

I have an interesting problem which I could not solve on my own and hope someone can provide some help.

I have a table of data with several attributes and a whitelist of attribute names. Is there any possibility in RapidMiner to filter the attributes based on that list?

Thanks for your help in advance
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,367 RM Data Scientist
    Hello matatio,

    you can do this using a whitelist in your repo/csv/excel/..

    You basicly read it and use a Loop values on the whitelist. I've created an example process on random data. I created an CSV file with two entries.

    one
    two
    Keep care of the excecution order. The remember operators need to be excecuted before their associated recall operators.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="6.1.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="multi classification"/>
          </operator>
          <operator activated="true" class="remember" compatibility="6.1.000" expanded="true" height="60" name="Remember" width="90" x="179" y="30">
            <parameter key="name" value="DataSet"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="6.1.000" expanded="true" height="76" name="Create Empty" width="90" x="313" y="30">
            <process expanded="true">
              <operator activated="true" class="filter_examples" compatibility="6.1.000" expanded="true" height="94" name="Filter Examples (2)" width="90" x="45" y="30">
                <parameter key="condition_class" value="all"/>
                <parameter key="invert_filter" value="true"/>
                <list key="filters_list"/>
              </operator>
              <operator activated="true" class="remember" compatibility="6.1.000" expanded="true" height="60" name="Remember (3)" width="90" x="179" y="30">
                <parameter key="name" value="ResultingSample"/>
              </operator>
              <connect from_port="in 1" to_op="Filter Examples (2)" to_port="example set input"/>
              <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Remember (3)" to_port="store"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="read_csv" compatibility="6.1.000" expanded="true" height="60" name="Read CSV" width="90" x="447" y="120">
            <parameter key="csv_file" value="C:\Users\Martin\Rapidforum\List"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="att1.true.polynominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="loop_values" compatibility="6.1.000" expanded="true" height="76" name="Loop Values" width="90" x="581" y="120">
            <parameter key="attribute" value="att1"/>
            <process expanded="true">
              <operator activated="true" class="recall" compatibility="6.1.000" expanded="true" height="60" name="Recall" width="90" x="313" y="120">
                <parameter key="name" value="DataSet"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="6.1.000" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="120">
                <parameter key="parameter_string" value="label=%{loop_value}"/>
                <parameter key="condition_class" value="attribute_value_filter"/>
                <list key="filters_list"/>
              </operator>
              <operator activated="true" class="remember" compatibility="6.1.000" expanded="true" height="60" name="Remember (2)" width="90" x="581" y="120">
                <parameter key="name" value="ResultingSample"/>
              </operator>
              <connect from_port="example set" to_port="out 1"/>
              <connect from_op="Recall" from_port="result" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Remember (2)" to_port="store"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="recall" compatibility="6.1.000" expanded="true" height="60" name="Recall (2)" width="90" x="715" y="120">
            <parameter key="name" value="DataSet"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_op="Create Empty" to_port="in 1"/>
          <connect from_op="Create Empty" from_port="out 1" to_port="result 1"/>
          <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Recall (2)" from_port="result" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mataiomataio Member Posts: 6 Contributor I
    Thank you for your reply but I'm looking for something else, my whitelist contains the names of the attributes I want to keep, the rest should be removed. I don't have a specific attribute of type name.

    Basically, is it possible to use the operator Select Attributes instead of Filter Examples in the loop with the following parameters?
    - filter type: regular expression (?)
    - regular expression: something like attribute_name=%{loop_value}
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,367 RM Data Scientist
    Hi,

    yes. this is basicly one way to go. If you have a  pattern what to filter. E.g. everything which starts with "att" you can use a simple regex for filtering. There are several tutorials around
    Otherwise you can simply use "single" in Generate Attribute and invert the selection. Attached is a process which should help you

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="6.1.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="multi classification"/>
          </operator>
          <operator activated="true" class="remember" compatibility="6.1.000" expanded="true" height="60" name="Remember" width="90" x="179" y="30">
            <parameter key="name" value="DataSet"/>
          </operator>
          <operator activated="true" class="read_csv" compatibility="6.1.000" expanded="true" height="60" name="Read CSV" width="90" x="447" y="120">
            <parameter key="csv_file" value="C:\Users\Martin\Rapidforum\List"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="att1.true.polynominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="loop_values" compatibility="6.1.000" expanded="true" height="76" name="Loop Values" width="90" x="581" y="120">
            <parameter key="attribute" value="att1"/>
            <process expanded="true">
              <operator activated="true" class="recall" compatibility="6.1.000" expanded="true" height="60" name="Recall" width="90" x="313" y="120">
                <parameter key="name" value="DataSet"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="6.1.000" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="120">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="%{loop_value}"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="false" class="filter_examples" compatibility="6.1.000" expanded="true" height="94" name="Filter Examples" width="90" x="514" y="390">
                <parameter key="parameter_string" value="label=%{loop_value}"/>
                <parameter key="condition_class" value="attribute_value_filter"/>
                <list key="filters_list"/>
              </operator>
              <operator activated="true" class="remember" compatibility="6.1.000" expanded="true" height="60" name="Remember (2)" width="90" x="581" y="120">
                <parameter key="name" value="DataSet"/>
              </operator>
              <connect from_port="example set" to_port="out 1"/>
              <connect from_op="Recall" from_port="result" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Remember (2)" to_port="store"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="recall" compatibility="6.1.000" expanded="true" height="60" name="Recall (2)" width="90" x="715" y="120">
            <parameter key="name" value="DataSet"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_port="result 1"/>
          <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Recall (2)" from_port="result" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mataiomataio Member Posts: 6 Contributor I
    Thank you so much, worked perfectly :)
Sign In or Register to comment.