Options

# Removing Univariate Outliers (IQR)

Hi Everyone,

I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?

Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...

Maybe there is an easy way to do this with the R extension (coming soon)?

Any Suggestions, ???

-Gagi

I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?

Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...

Maybe there is an easy way to do this with the R extension (coming soon)?

Any Suggestions, ???

-Gagi

0

## Answers

2,531Unicornwell, what about generating an attribute defining if something is within 1.5 IQR? You can extract the mean and standard deviation from the extract macro operator and then use this values inside the Generate Attributes operator.

If you are going to make this more usable by implementing an operator, it would be very kind if you would contribute it.

Greetings,

Sebastian

241Contributor IIThe problem with mean and standard deviation is that they are

not robust. For example, if I have a 10 sigma outlier in one of my attribute columns the mean of that column is severely skewed also the variance is messed up. This can be a significant problem when trying to z-transform data for processing.IQR is based on the

median. I know you can extract the median for a column, but then you need the upper and lower quartiles. See below:I know this can easily be done in R (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/IQR.html). So I might just wait until your R extension is out.

In any case, having the option to normalize data based on standard deviation and zero mean centering is great, however it is essential to have median centering and normalizig by 1.349 IQR. See Below:

For normally N(m,1) distributed X, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e.,

for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349.This would be a great addition to RM.

-Gagi ;D

2,531Unicornwell, if you have a piece of code for this, that would fit into the com.rapidminer.operator.preprocessing.normalization.Normalization operator, I would just include this option in the next release. Unfortunately we are currently to busy to add it ourselves, to many working places at once...

Anyway I find this a good idea and if you don't send code, please send in a feature request as detailed as possible

Greetings,

Sebastian

241Contributor III will try to get an operator made once R is integrated. Once I get RM building from source I will take a look at modifying the code.

Thanks,

-Gagi

241Contributor IIYou should have a check list of things added so we can truly appreciate the good work you do!

-Gagi

2,531Unicornactually we forgot to mention this, there have so much been added...

And actually you have to thank brendon who contributed this!

Greetings,

Sebastian

241Contributor II-Gagi

1Contributor I3,517RM Data ScientistDortmund, Germany