Options

Outlier deection: How to change an outlier value to mean value of the attribute

EcclesiastesEcclesiastes Member Posts: 1 Contributor I
edited November 2018 in Help
Hi,

I'm working on a high dimensional data (>250 attributes) to compare the different outlier detection methods.
I have already tested CoF and teh Distance-based method. There prodoce total different reults, but that was expected.

However, for forther comarision I like to treat the detected outlier in a simple workflow like this:

1) run outlier detecion
2) replace detectet outlier value with mean value of the attribute
3) run a clasifier on the preprocessed data.

Both, CoF and Density based outlier detection creates a new boolean variable outlier = (true / false)
that means i need just something like a filter, which selcts teh affected value of the attribute and
a simle replacement with the mean value of the attribute.

I have just found a "replace missing value" function which offers the mean replacement,
but not for outlier.

Is there a way, to do this sort of value replacement in rapid miner?
I have used RapidMinder today for the first time, so Im no expert..

Any comments are  appreciated

marvin

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marvin,
    this is possible, but unfortunately a little bit complicated :) I append a process, that will first perform an outlier detection on artificial data and then select only examples where outlier = true is. Then the process iterates over each example and sets the value of the attribute att1 do unknown, so that you can use the replace missing values operator to assign a new value.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="646" width="714">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="75">
            <parameter key="target_function" value="spiral cluster"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="detect_outlier_cof" expanded="true" height="76" name="Detect Outlier (COF)" width="90" x="179" y="75">
            <parameter key="number_of_neighbors" value="1"/>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="75">
            <parameter key="name" value="outlier"/>
          </operator>
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="447" y="75">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="Outlier =true"/>
          </operator>
          <operator activated="true" class="loop_examples" expanded="true" height="76" name="Loop Examples" width="90" x="581" y="75">
            <parameter key="iteration_macro" value="exampleIndex"/>
            <process expanded="true" height="646" width="714">
              <operator activated="true" class="set_data" expanded="true" height="76" name="Set Data" width="90" x="45" y="30">
                <parameter key="attribute_name" value="att1"/>
                <parameter key="example_index" value="%{exampleIndex}"/>
                <parameter key="value" value="NaN"/>
              </operator>
              <connect from_port="example set" to_op="Set Data" to_port="example set input"/>
              <connect from_op="Set Data" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Detect Outlier (COF)" to_port="example set input"/>
          <connect from_op="Detect Outlier (COF)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Filter Examples" from_port="original" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Another, probably more elegant solution would be as follows:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="646" width="714">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="75">
            <parameter key="target_function" value="spiral cluster"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="detect_outlier_cof" expanded="true" height="76" name="Detect Outlier (COF)" width="90" x="179" y="75">
            <parameter key="number_of_neighbors" value="1"/>
          </operator>
          <operator activated="true" class="generate_attributes" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="75">
            <list key="function_descriptions">
              <parameter key="att1_replaced" value="if(Outlier == &quot;true&quot;, 0/0, att1)"/>
            </list>
          </operator>
          <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="75">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="rename" expanded="true" height="76" name="Rename" width="90" x="581" y="75">
            <parameter key="old_name" value="att1_replaced"/>
            <parameter key="new_name" value="att1"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Detect Outlier (COF)" to_port="example set input"/>
          <connect from_op="Detect Outlier (COF)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Sorry, but I hope, that one of the processes will suit your needs.


    Greetings,
      Sebastian
  • Options
    mksaadmksaad Member Posts: 42 Maven
    Hi Sebastian,

    For future reference, outlier detection operators based on neighbors should not take the parameter (number of neighbors = 1). Because the nearest neighbor (number of neighbors = 1) for a given example is the example itself. This would lead to make distance based outliers detection methods to detect outliers improperly.

    Please correct me if I am wrong.

    Regards,
    --Motaz 
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Motaz,
    I think you are right depending on the definition of neighbor :) If it excludes the actual point, 1 is a good value. But at least one should asure that it is then meant in a reasonable way, I will try to remember to look in the code.

    Greetings,
      Sebastian
  • Options
    mksaadmksaad Member Posts: 42 Maven
    Yes, you are right, 1 is good and it is quite simple rule. But I think the code do not exclude the point itself from example search (at least COF and distance based outlier methods)

    Warm Greetings
    --Motaz
Sign In or Register to comment.