Removing Univariate Outliers (IQR)

dragoljubdragoljub Member Posts: 241 Contributor II
edited November 2018 in Help
Hi Everyone,

I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?

Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...

Maybe there is an easy way to do this with the R extension (coming soon)?

Any Suggestions,  ???
-Gagi

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Gagi,
    well, what about generating an attribute defining if something is within 1.5 IQR? You can extract the mean and standard deviation from the extract macro operator and then use this values inside the Generate Attributes operator.
    If you are going to make this more usable by implementing an operator, it would be very kind if you would contribute it.


    Greetings,
      Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Hi Sebastian,

    The problem with mean and standard deviation is that they are not robust. For example, if I have a 10 sigma outlier in one of my attribute columns the mean of that column is severely skewed also the variance is messed up. This can be a significant problem when trying to z-transform data for processing.

    IQR is based on the median. I know you can extract the median for a column, but then you need the upper and lower quartiles. See below:

    image

    I know this can easily be done in R (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/IQR.html). So I might just wait until your R extension is out.

    In any case, having the option to normalize data based on standard deviation and zero mean centering is great, however it is essential to have median centering and normalizig by 1.349 IQR. See Below:

    For normally N(m,1) distributed X, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e., for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349. 

    This would be a great addition to RM.

    -Gagi  ;D
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    well, if you have a piece of code for this, that would fit into the com.rapidminer.operator.preprocessing.normalization.Normalization operator, I would just include this option in the next release. Unfortunately we are currently to busy to add it ourselves, to many working places at once...
    Anyway I find this a good idea and if you don't send code, please send in a feature request as detailed as possible :)


    Greetings,
      Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Hi Sebastian,

    I will try to get an operator made once R is integrated. Once I get RM building from source I will take a look at modifying the code.

    Thanks,
    -Gagi
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Just realized IQR made it into the normalization operator! ;D Thanks for integrating this guys!

    You should have a check list of things added so we can truly appreciate the good work you do!  ;)

    -Gagi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    actually we forgot to mention this, there have so much been added...

    And actually you have to thank brendon who contributed this!

    Greetings,
     Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Yea I asked Brendon implement it since he had more experience building RM from source. Thanks again for taking the time to include it in the latest RM release.

    -Gagi
  • Jeroen8Jeroen8 Member Posts: 1 Contributor I
    @land any update on this? I am interested in an operator to remove univariate outliers using Interquartile range (IQR) aswell
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    edited September 2021
    thats an old thread :). The operator Detect Outliers (Univariate) in operator toolbox extension allows you to do this.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.