Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Removing Univariate Outliers (IQR)
Hi Everyone,
I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?
Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...
Maybe there is an easy way to do this with the R extension (coming soon)?
Any Suggestions, ???
-Gagi
I would like to quickly and easily remove univariate outliers using Interquartile range (IQR). I have looked for an easy way to do this but I seem to be stuck with the available RM outliers. I know RM computes IQR for the box plots, but is there an operator that can simply do this and drop everything say outside 1.5*IQR?
Also removing these outliers is essential to avoid trouble with z-transform normalization since the standard deviation can be significantly skewed by a gross outlier. Things you start encountering with real data...
Maybe there is an easy way to do this with the R extension (coming soon)?
Any Suggestions, ???
-Gagi
0
Answers
well, what about generating an attribute defining if something is within 1.5 IQR? You can extract the mean and standard deviation from the extract macro operator and then use this values inside the Generate Attributes operator.
If you are going to make this more usable by implementing an operator, it would be very kind if you would contribute it.
Greetings,
Sebastian
The problem with mean and standard deviation is that they are not robust. For example, if I have a 10 sigma outlier in one of my attribute columns the mean of that column is severely skewed also the variance is messed up. This can be a significant problem when trying to z-transform data for processing.
IQR is based on the median. I know you can extract the median for a column, but then you need the upper and lower quartiles. See below:
I know this can easily be done in R (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/IQR.html). So I might just wait until your R extension is out.
In any case, having the option to normalize data based on standard deviation and zero mean centering is great, however it is essential to have median centering and normalizig by 1.349 IQR. See Below:
For normally N(m,1) distributed X, the expected value of IQR(X) is 2*qnorm(3/4) = 1.3490, i.e., for a normal-consistent estimate of the standard deviation, use IQR(x) / 1.349.
This would be a great addition to RM.
-Gagi ;D
well, if you have a piece of code for this, that would fit into the com.rapidminer.operator.preprocessing.normalization.Normalization operator, I would just include this option in the next release. Unfortunately we are currently to busy to add it ourselves, to many working places at once...
Anyway I find this a good idea and if you don't send code, please send in a feature request as detailed as possible
Greetings,
Sebastian
I will try to get an operator made once R is integrated. Once I get RM building from source I will take a look at modifying the code.
Thanks,
-Gagi
You should have a check list of things added so we can truly appreciate the good work you do!
-Gagi
actually we forgot to mention this, there have so much been added...
And actually you have to thank brendon who contributed this!
Greetings,
Sebastian
-Gagi
Dortmund, Germany