capping values above 95 percentile and below 5 percentile for all columns.

vishruth_muthyavishruth_muthya Member Posts: 4 Contributor I
edited December 2018 in Help

I have a big data set with 1800+ columns and 125000 rows of data of which 90% are numerical. Before i perform any other opetations first i want to cap my data at 95percentile and 5percentile for all numerical columns. 

Can you point me to a solution of suggest me with steps to perform this. 

 

Using RM 9.

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi!

     

    How would you cap your data? Remove every example that is below or above the percentiles in at least one attribute? Or change the values to the minimum and maximum?

     

    Regards,

    Balázs

  • vishruth_muthyavishruth_muthya Member Posts: 4 Contributor I

    cap all values above 95percentile to 95th percentile value.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    There may be a more efficient way to do this, but here is one method.  This uses a set of macros to extract the values of the 5th and 95th percentile and then cap/floor each numerical attribute accordingly (using a loop).

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.0.002" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Iris"/>
    </operator>
    <operator activated="true" class="concurrency:loop_attributes" compatibility="9.0.002" expanded="true" height="82" name="Loop Attributes" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="value_type"/>
    <parameter key="value_type" value="numeric"/>
    <parameter key="reuse_results" value="true"/>
    <process expanded="true">
    <operator activated="true" class="sort" compatibility="9.0.002" expanded="true" height="82" name="Sort" width="90" x="45" y="34">
    <parameter key="attribute_name" value="%{loop_attribute}"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="9.0.002" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="34">
    <parameter key="macro" value="num_examples"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="9.0.002" expanded="true" height="82" name="Generate Macro" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="5perc" value="floor(0.05*eval(%{num_examples}))"/>
    <parameter key="95perc" value="floor(0.95*eval(%{num_examples}))"/>
    </list>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="9.0.002" expanded="true" height="68" name="Extract Macro (2)" width="90" x="447" y="34">
    <parameter key="macro" value="floor"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="%{loop_attribute}"/>
    <parameter key="example_index" value="%{5perc}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="9.0.002" expanded="true" height="68" name="Extract Macro (3)" width="90" x="581" y="34">
    <parameter key="macro" value="cap"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="%{loop_attribute}"/>
    <parameter key="example_index" value="%{95perc}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="9.0.002" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34">
    <list key="function_descriptions">
    <parameter key="%{loop_attribute}" value="if(eval(%{loop_attribute})&lt;eval(%{floor}),eval(%{floor}),eval(%{loop_attribute}))"/>
    <parameter key="%{loop_attribute}" value="if(eval(%{loop_attribute})&gt;eval(%{cap}),eval(%{cap}),eval(%{loop_attribute}))"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Extract Macro (2)" from_port="example set" to_op="Extract Macro (3)" to_port="example set"/>
    <connect from_op="Extract Macro (3)" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Iris" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
    <connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    In theory you would be able to do this a bit more cleanly by using Discretize by Frequency to create your percentile version of the numerics (you need 20 equal size bins of 5% each) and then use that to extract the values you need.  However, if there are too many ties and it can't create 20 equal size bins then this operator throws an error. 

    If someone else has a more elegant solution I would love to see it.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Actually this reminded me that there is also an "Extract Quantiles" operator in the Statistics extension (available in the Marketplace) that would allow the first part of my solution to be done more efficiently---no need to sort the entire dataset (which could take a while with large datasets) and extract those values via Generate Macros.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    Aggregate will almost for sure support percentiles as an aggregation function with the next RM release.  That makes our life easier.

     

    BR,

    Martin

     

     

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
    As I have found this thread while looking for some solution of more or less problem, I'd like to assure there's actually a need for an operator which could perform this sort of capping on numerical data easy and fast. Something resembling REPLACE RARE VALUES operator which I believe was first introduced by my humble request :) 

    I am talking about situations where you actually have outliers or erroneous data, but applying outlier detection algorithms or manually constructing the process for capping those is just too time and resource consuming. Honestly, I haven't succeeded at all so far in effective use of any outlier detection operators for such problems. 

    I am working a lot with user entered data, consequently this is much prone to some erroneous values. For example, here are two variables which represent person's income. Income is something which users enter themselves on a web form.

    1) This looks quite normal, the low average is because most values are zeros, but at least the right tail of the distribution is sane: 



    2) This one has normal (expected) average for an income figure, but some crazy deviation, and obviously the maximum value shouldn't represent an income figure at all:



    What would be really nice to have here is an operator which is capable of capping such distributions based on some chosen statistical criteria (min/max values, sigma, etc), and is capable of doing it automatically. For example, for variables like income or age we may assume either a distribution is close to normal or somehow skewed, but still bell shaped distribution with a long (but finite) tail on one end. 

    I totally agree that such transformation is achievable thru some subprocess, but it is still a matter of designing it, and it's not always trivial; and at the end, the final process would be overcomplicated. 

    @mschmitz your thoughts? :)
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    yep, you were the trigger to write the rare values operator :)

    On your problem - Did you try Tuckey Test? Its similar to what you want to do.It wouldn't surprise me if it would be exactly the same for distribution where the assumption applies (exp. familty).

    Other idea, Normalize to [0-1], cut on 0.05-0.95, de-normalize ?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
    edited February 2019
    Hey @mschmitz

    I played around with Tukey Test a bit, but haven't figured a stable way for handling its results.

    I just tried to use Normalize to [0-1] also, [un]surprisingly, normalizing the variable values itself OR the Tukey test confidences gives exactly the same values! But could you advise how to handle the following output for example? Fact is, that I want to cut the super long right tail of the distribution and values like 450000 are most likely also to be caped, but I have hard time deciding on exactly how to automatically define normalized confidence / normalized income cut-off boundary in that case?

    UPDATE: performed some brain activity here and the desired solution might be 2-pass: 1st cut finds and removes farthest outlier, 2nd pass caps remaining values over 0.95, if any


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    There is actually an operator to do this very easily called Winsorize which is part of the Finance and Economics extension .  As many of the folks on this thread know, that extension has some problems so beware, but that operator seems to work pretty well.  Martin you might also take a look at it in case you want to program your own version for the next release of the Operator Toolbox.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
    Yay @Telcontar120! Thanks for pointing this out :) would give it a try definitely. 
  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @kypexin

    I am the author of the TukeyTest operator. I tried to reproduce your results, but don't get exactly the same values. First I am sure that you also normalized the Confidence attribute of the TukeyTest (it would assign Top Outlier only to confidences above 1.0).
    Could you also check if your values are really exactly the same, or if they may differ in later digits. If you still have problems or kind of strange results, could you provide example process file and data (PM if it is confidential data)?

    For the problem stated: In general the Tukey Test calculates a "confidence" how much the actual value differs from the median of the distribution in terms of the distance between the first quartile (25%) and the third quartile (75%). (Hence it takes into account if the distribution itself is wide or small). Unfortunately the income values will have several such outliers (as you already can see in your examples), which are completely fine (as the income distribution has this large tail). To remove real outliers you could apply a filter on the confidence but with an higher threshold than 1.0. But you cannot use it to exactly remove a specific percentage of the data. 

    I think using Normalization does also not work in this case, cause the normalized values does not give you any information about the percentiles. Just look at the example from @kypexin, when you cut on 0.05 - 0.95 you would remove all entries.

    Unfortunately we don't have any operator to remove a specific top/bottom percentage of the data. I will put it on the roadmap for the operator toolbox. 

    Best regards,
    Fabian
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
    Hi @tftemme

    First, thanks for providing detailed explanation! 

    I always forgot about that default rounding in RM! I rechecked once again, the actual numbers differ in other digits, but it is still surprising how perfectly they match when rounded even to the 3rd digit! (and yes, the confidence value is normalized here as well). 

    I was trying to get meaningful results from Tukey test before, but you are right, it really depends on the certain variable distribution (my certain example made me struggle for quite some time to fix), and also more flexible settings control would come handy.

    I think above mentioned WINSORIZE operator is really really close to what I really was seeking; sad it's a part of an unsupported and abandoned extension so its future is questionable. It would be great to have its analog in Operator Toolbox, because it utilizes  pretty straightforward and useful technique for numeric data preparation. 
Sign In or Register to comment.