Options

# capping values above 95 percentile and below 5 percentile for all columns.

vishruth_muthya
Member Posts:

**4**Contributor II have a big data set with 1800+ columns and 125000 rows of data of which 90% are numerical. Before i perform any other opetations first i want to cap my data at 95percentile and 5percentile for all numerical columns.

Can you point me to a solution of suggest me with steps to perform this.

Using RM 9.

0

## Answers

955UnicornHi!

How would you cap your data? Remove every example that is below or above the percentiles in at least one attribute? Or change the values to the minimum and maximum?

Regards,

Balázs

4Contributor Icap all values above 95percentile to 95th percentile value.

1,635UnicornThere may be a more efficient way to do this, but here is one method. This uses a set of macros to extract the values of the 5th and 95th percentile and then cap/floor each numerical attribute accordingly (using a loop).

In theory you would be able to do this a bit more cleanly by using Discretize by Frequency to create your percentile version of the numerics (you need 20 equal size bins of 5% each) and then use that to extract the values you need. However, if there are too many ties and it can't create 20 equal size bins then this operator throws an error.

If someone else has a more elegant solution I would love to see it.

Lindon Ventures

Data Science Consulting from Certified RapidMiner Experts

1,635UnicornActually this reminded me that there is also an "Extract Quantiles" operator in the Statistics extension (available in the Marketplace) that would allow the first part of my solution to be done more efficiently---no need to sort the entire dataset (which could take a while with large datasets) and extract those values via Generate Macros.

Lindon Ventures

Data Science Consulting from Certified RapidMiner Experts

3,520RM Data ScientistHi,

Aggregate will almost for sure support percentiles as an aggregation function with the next RM release. That makes our life easier.

BR,

Martin

Dortmund, Germany

291Unicorneasy and fast. Something resembling REPLACE RARE VALUES operator which I believe was first introduced by my humble requestI am talking about situations where you actually have outliers or erroneous data, but applying outlier detection algorithms or manually constructing the process for capping those is just too time and resource consuming. Honestly, I haven't succeeded at all so far in effective use of any outlier detection operators for such problems.

I am working a lot with user entered data, consequently this is much prone to some erroneous values. For example, here are two variables which represent person's income. Income is something which users enter themselves on a web form.

1) This looks quite normal, the low average is because most values are zeros, but at least the right tail of the distribution is sane:

2) This one has normal (expected) average for an income figure, but some crazy deviation, and obviously the maximum value shouldn't represent an income figure at all:

What would be really nice to have here is an operator which is capable of capping such distributions based on some chosen statistical criteria (min/max values, sigma, etc), and is capable of doing it automatically. For example, for variables like income or age we may assume either a distribution is close to normal or somehow skewed, but still bell shaped distribution with a long (but finite) tail on one end.

I totally agree that such transformation is achievable thru some subprocess, but it is still a matter of designing it, and it's not always trivial; and at the end, the final process would be overcomplicated.

@mschmitz your thoughts?

Vladimir

http://whatthefraud.wtf

3,520RM Data ScientistDortmund, Germany

291UnicornI played around with Tukey Test a bit, but haven't figured a stable way for handling its results.

I just tried to use Normalize to [0-1] also, [un]surprisingly, normalizing the variable values itself OR the Tukey test confidences gives

exactlythe same values! But could you advise how to handle the following output for example? Fact is, that I want to cut the super long right tail of the distribution and values like 450000 are most likely also to be caped, but I have hard time deciding on exactly how to automatically define normalized confidence / normalized income cut-off boundary in that case?UPDATE: performed some brain activity here and the desired solution might be 2-pass: 1st cut finds and removes farthest outlier, 2nd pass caps remaining values over 0.95,

if any.Vladimir

http://whatthefraud.wtf

1,635UnicornLindon Ventures

Data Science Consulting from Certified RapidMiner Experts

291UnicornVladimir

http://whatthefraud.wtf

164RM ResearchI am the author of the TukeyTest operator. I tried to reproduce your results, but don't get exactly the same values. First I am sure that you also normalized the Confidence attribute of the TukeyTest (it would assign Top Outlier only to confidences above 1.0).

Could you also check if your values are really exactly the same, or if they may differ in later digits. If you still have problems or kind of strange results, could you provide example process file and data (PM if it is confidential data)?

For the problem stated: In general the Tukey Test calculates a "confidence" how much the actual value differs from the median of the distribution in terms of the distance between the first quartile (25%) and the third quartile (75%). (Hence it takes into account if the distribution itself is wide or small). Unfortunately the income values will have several such outliers (as you already can see in your examples), which are completely fine (as the income distribution has this large tail). To remove real outliers you could apply a filter on the confidence but with an higher threshold than 1.0. But you cannot use it to exactly remove a specific percentage of the data.

I think using Normalization does also not work in this case, cause the normalized values does not give you any information about the percentiles. Just look at the example from @kypexin, when you cut on 0.05 - 0.95 you would remove all entries.

Unfortunately we don't have any operator to remove a specific top/bottom percentage of the data. I will put it on the roadmap for the operator toolbox.

Best regards,

Fabian

291UnicornFirst, thanks for providing detailed explanation!

I

alwaysforgot about that default rounding in RM! I rechecked once again, the actual numbers differ in other digits, but it is still surprising how perfectly they match when rounded even to the 3rd digit! (and yes, the confidence value is normalized here as well).I was trying to get meaningful results from Tukey test before, but you are right, it really depends on the certain variable distribution (my certain example made me struggle for quite some time to fix), and also more flexible settings control would come handy.

I think above mentioned WINSORIZE operator is really really close to what I really was seeking; sad it's a part of an unsupported and abandoned extension so its future is questionable. It would be great to have its analog in Operator Toolbox, because it utilizes pretty straightforward and useful technique for numeric data preparation.

Vladimir

http://whatthefraud.wtf