ALL FEATURE REQUESTS HERE ARE MONITORED BY OUR PRODUCT TEAM.

VOTING MATTERS!

IDEAS WITH HIGH NUMBERS OF VOTES (USUALLY ≥ 10) ARE PRIORITIZED IN OUR ROADMAP.

NOTE: IF YOU WISH TO SUGGEST A NEW FEATURE, PLEASE POST A NEW QUESTION AND TAG AS "FEATURE REQUEST". THANK YOU.

Add a native Rank operator to RapidMiner Studio

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
There have been several recent threads asking about how to calculate ranks using RapidMiner.  Currently there is a Rank operator in the old and unsupported (and somewhat buggy) Finance & Economics extension, but it is hard to recommend that solution, especially to newer users.  The alternative using RapidMiner native operators currently is very cumbersome and complex for something as conceptually simple as a rank calculation.  It would be so much easier if RapidMiner simply added a native Rank operator to the basic data ETL toolkit.
Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts
Tagged:
4
4 votes

Open for Voting · Last Updated

PROD-176

Comments

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    what would the Rank operator do?
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @mschmitz it would calculate the numerical rank of each example based on a specific numerical attribute(s) and its values.  It is equivalent to sorting the examples by that attribute and then assigning a sequential numerical id.  Take a look at the Rank operator in the Finance & Economics extension for a working example today that can be used simultaneously on any arbitrary set of numerical attributes.
    A more sophisticated version would even provide options around whether to sort ascending vs descending and how to handle tie values (assign lowest rank, assign highest rank, or assign midpoint rank), and the option to either replace the original attribute vs adding a new attribute with the rank value.  
    This is conceptually similar to assigning the percentile value to all examples.  There are many contexts in which this is a useful transformation, including many non-parametric calculations, or using rank value rather than raw values as predictors in models to eliminate scalar effects (e.g., of outliers) while preserving ordinality.
    This can all be done manually now in RapidMiner but it requires a daisy chain of related operators (e.g., Generate Copy, Sort, Generate ID, etc.) that would be nice to combine all into one simple operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    we always face the issue: number of operators vs ease of use. If it's just Sort + GenId i would oppose a new operator. It only makes sense if there is more involved than "just this" i.e your percentiles.

    @tftemme thoughts?

    BR,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @mschmitz But the reality is that this is a very commonly required transformation.  And you often want to do it on a whole set of attributes at once, which means 4 operators (sort, generate id, set role, and rename) inside a Loop Attributes.  In my mind that's enough of a hassle to be worth a separate operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Also the method above doesn't handle ties very well either, which requires even more complexity to address properly with rank values.  
    P.S.  I'd like there to be a percentile operator for exactly the same reason!  Once again, it can be done manually using a Loop and similar operators to the ones above, only with the additional complexity of calculating the percentile value from the raw rank value.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    you are aware that Aggregate can now calculate percentiles?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @mschmitz Of course, but it calculates specific requested percentile values, it does not easily provide percentile rankings for all examples.  Those are two related but different operations.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @Telcontar120 , @mschmitz

    I think >=4 operators for one frequent transformation is enough to put this into one operator. I will create a ticket for that for the operator toolbox. We will have to see how to put it into it. If you have further description on how the operator should work or what options it should provide, feel free to post them. The more description the better.

    Best regards,
    Fabian

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @tftemme feel free to reach out via PM if you want me to explain in further detail about the specifications that I listed above.  Automatic attribute copying/renaming, tie handling, and multi-attribute selection would probably be the most important options to include to save time.
    I realized you could also actually have a single operator to handle both raw ranks as well as percentile ranks, with another option to control the output format (rank vs percentile rank).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.