Handling Outliers

mnorthmnorth Member Posts: 7 Contributor II
edited November 2018 in Help
Hi All,

I have a question about outlier handling in Rapid Miner.  Consider the following values of an attribute:

2
3
3
2
4
17
5
3
28
4
5
5
2
5
26
1
3
3
4
2

I have shortened the actual list of attribute values simply for the purposes of this question.  As you can see, most of the observations are between 1 and 5.  With a mean of just over 6, and a standard deviation of about 7.5, three values stand out as outliers, with two of them being more than two standard deviations away from the mean.  With a much larger number of observations, these outliers become even more pronounced.

I've analyzed this variable using the Outlier operators in RapidMiner, and I'm confident that values above 8 are outliers, so my question is not about how to detect the outliers.  My question is about *handling* them.  I can use a filter to remove the entire observation associated with each outlier from my data set, but these observations have other attributes with accurate data that I want to keep and use in my model.--what I really want to do is set the outliers in this attribute to missing, and then use a missing value handler to set them to a normed value -- something like the mean, median or mode.  This way I can keep the good attributes, "fix" the bad ones, and model without losing any observations.  (I'm aware of the risks/issues associated with replacing data using descriptive statistics, and I'm choosing to accept those in this particular instance.)

I've tried doing this six ways from Sunday, but I'm not coming up with good ideas.  I can add a series of "Declare Missing Values" operators in rapid succession and use each one to set each distinct outlier value to missing, but this is a real pain, especially if I have ten million observations with 25 distinct outlier values.  I end up with 25 DMV operators in a row, cluttering my model and making it difficult to tell if I've treated all the outliers or not.  I tried using a DMV numeric_value_filter with an expression of >8, but this parameter must not work the way I think it's supposed to, because I always get an 'attribute name undefined' error when I try that.

So, to boil it down: I would like to be able to set a range or expression on a numeric attribute that would set all values to missing in that attribute if they fall outside the range, or meet the expression parameters, and I'd like to accomplish it with a single operator.  Any suggestions or solutions would be most appreciated.

Matt

Best Answer

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Solution Accepted
    Hi Matt,
    if the process is cool, just add it to MyExperiment using the Community Extension!

    Greetings,
      Sebastian

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Matt,
    what about the following approach:
    1. Copy your dataset using Multiplier.
    2. Filter the original examples so that you have only examples with the attribute values being outliers.
    3. Generate a new Attribute containing missings
    4. Filter the original examples so that you have the examples that are no outliers
    5. Generate a new Attribute with the same name as a copy of the original one
    6. Append both sets.
    7. Remove the original attribute / Rename the new one to the old name.

    Greetings,
      Sebastian
  • mnorthmnorth Member Posts: 7 Contributor II
    Thank you Sebastian.  This is an idea I had not thought of before, and it did work, although it was still more steps than I had hoped for.  Perhaps the ability to set all values in a certain range to missing in a single expression would be a good feature addition, I will add this to the feature request forum.

    Matt
Sign In or Register to comment.