Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

# Removing Univariate Outliers (IQR)

Member Posts: 241 Contributor II
edited November 2018 in Help
Hi Everyone,

I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?

Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...

Maybe there is an easy way to do this with the R extension (coming soon)?

Any Suggestions,  ???
-Gagi

## Answers

• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi Gagi,
well, what about generating an attribute defining if something is within 1.5 IQR? You can extract the mean and standard deviation from the extract macro operator and then use this values inside the Generate Attributes operator.
If you are going to make this more usable by implementing an operator, it would be very kind if you would contribute it.

Greetings,
Sebastian
• Member Posts: 241 Contributor II
Hi Sebastian,

The problem with mean and standard deviation is that they are not robust. For example, if I have a 10 sigma outlier in one of my attribute columns the mean of that column is severely skewed also the variance is messed up. This can be a significant problem when trying to z-transform data for processing.

IQR is based on the median. I know you can extract the median for a column, but then you need the upper and lower quartiles. See below:

I know this can easily be done in R (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/IQR.html). So I might just wait until your R extension is out.

In any case, having the option to normalize data based on standard deviation and zero mean centering is great, however it is essential to have median centering and normalizig by 1.349 IQR. See Below:

For normally N(m,1) distributed X, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e., for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349.

This would be a great addition to RM.

-Gagi  ;D
• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi,
well, if you have a piece of code for this, that would fit into the com.rapidminer.operator.preprocessing.normalization.Normalization operator, I would just include this option in the next release. Unfortunately we are currently to busy to add it ourselves, to many working places at once...
Anyway I find this a good idea and if you don't send code, please send in a feature request as detailed as possible

Greetings,
Sebastian
• Member Posts: 241 Contributor II
Hi Sebastian,

I will try to get an operator made once R is integrated. Once I get RM building from source I will take a look at modifying the code.

Thanks,
-Gagi
• Member Posts: 241 Contributor II
Just realized IQR made it into the normalization operator! ;D Thanks for integrating this guys!

You should have a check list of things added so we can truly appreciate the good work you do!

-Gagi
• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi,
actually we forgot to mention this, there have so much been added...

And actually you have to thank brendon who contributed this!

Greetings,
Sebastian
• Member Posts: 241 Contributor II
Yea I asked Brendon implement it since he had more experience building RM from source. Thanks again for taking the time to include it in the latest RM release.

-Gagi
• Member Posts: 1 Learner II
@land any update on this? I am interested in an operator to remove univariate outliers using Interquartile range (IQR) aswell
• Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,527 RM Data Scientist
edited September 2021
thats an old thread . The operator Detect Outliers (Univariate) in operator toolbox extension allows you to do this.

Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany
Sign In or Register to comment.