add ability to calculate arbitrary percentile values (easily)

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

As any novice analyst knows, summarizing data with percentiles is part of basic exploratory data analysis.  So I was actually very surprised that RapidMiner doesn't already appear to have this functionality built in, but I don't see any way to easily calculate the percentile values of a given numerical attribute.  For example, in the quartile graph, the box is based on the 25%, median, and 75% percentile values, and the whiskers show the 5% and 95% values (I believe).  But there doesn't appear to be a simple way to generate that same information numerically from the dataset in a straightforward way.  Ideally it would be done via an operator with an arbitrary percentile parameter (like in Excel) where you can simply enter the percentile value from 1 to 100 that you want to see.  

 

It should be set up so you can also access this percentile function from the aggregate menu, so you would have those values to compare to the average and median, which are available there now.

 

P.S.  I know you can try to get at this by using the binning operator, but this is quite cumbersome and doesn't give you the output in a way that is easy to use.   So I don't regard that as an adequate substitution.

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts
0
0 votes

Fixed and Released · Last Updated

Comments

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Isn't this functionality in the Statistics Extension from Old World Computing?  

     

    I would like to see an operator that extracts statistics like this across the dataset in a summary table in a similar way to R's describe functions. 

    In addition an 'advanced statistics' tab would be very handy.  @land think you'll be able to include any of this in a future update? 

     

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi,

     

    John is right, that's part of the functionality offered by our Statistics Extension. Unfortunately it is not yet on the marketplace, but we plan to move it there as soon as possible.

    You can get more information and a download link on our website  oldworldcomputing.com

    About the advanced statistics tab: I'm not sure if that is so easy to add and if the additional benefit would outweight it. I nearly never use the percentiles but use histograms instead (if I take a look on the data myself at all).

     

    Greetings,

      Sebastian

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    oh this is good news.  Thanks, Sebastian.  I have had to do percentiles in quite an archaic way.  Can it do normal distribution probabilities and inverses?  I keep hoping that it appears in the "calculator" for generate attributes one of these days...

     

    Scott

     

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Do you want to calculate the density of a normal distribute at a given point x? Or what are you refering to? I don't think that this is possible, right now. You can check with a T-Test whether a population matches a given normal distribution, but not against a single point. But shouldn't be difficult to compute. I think we can add this as a feature request for the next version.

     

    Greetings,

      Sebastian

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    yes exactly.  The Excel equivalent would be the NORM.DIST and NORM.INV functions.

     

    Scott

     

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi Scott,

    yes, that should be quite easily. We could include it in the next release of the extension. Would have been handy for myself also for some times.

     

    Greetings,

       Sebastian

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    see Statistics Extension

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    FYI, This has now been resolved with the addition of percentile calculations to the base Aggregate operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.