Options

simple operator or method for combining nominal categories?

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Is there some easy way to combine nominal categories together based on frequency?  For example, if I have a nominal attribute with 10 different possible values, but I only want to keep the top 5 (by frequency) and then put the rest into an "Other" category.
This is obviously possible using some manual recoding logic, but I feel like there is a better way that is slipping my mind.  Is there some operator for this that I am forgetting?  Discretize operators aren't ideal because they only work on numerical attributes so that would require recoding and loses the underlying nominal values. 
I have to do this with a large number of attributes/categories so I am looking for a solution that doesn't require manual recoding of the categories.
Thanks in advance!
Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts
Tagged:

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Solution Accepted
    Hi,
    Replace Rare Values in Operator Toolbox is your friend :)

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hi @Telcontar120

    If I understood your problem well, I would do something like this:
    • Generate a new field containing the frequency, alongside your category.
    • Generate a second field doing some discretization on the frequency, not the params.
    • Generate a third field with some code: if(frequency > 50;[Category];"Other").
    • Use the third field with the "combined" target.
    But now I'm wondering if there is anything I missed about the whole question, as my solution sounds too simplistic to me at least.

    All the best, sensei!

    Rodrigo.
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Thanks guys for the fast replies.  Both approaches would work, but @rfuentealba you should check out the single operator that @mschmitz mentions because that is exactly what I wanted!  
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Awesome! I didn't know about it. Thank you both.
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    what did you guys search for in first place? The current pseudonyms (tags) for this operators are:
    <tags>
    <tag>Missing</tag>
    <tag>Map</tag>
    </tags>
    Which is apprently not enough. Since I am the author of the operator i would love to know what we need to add so that it is easier to find.

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Great question @mschmitz !  I searched for "nominal" with various combinations of "categories values discretize map replace".  It's probably my fault that I didn't think to search using only "replace" or "map" since I am aware of other RapidMiner operators with these names that are similar, but I was thinking they would require manual mapping which I wanted to avoid.  I would say "nominal" is a key term because in this use case there are other similar operators (the "discretize" ones) that only work on numericals so I was trying to focus on those operators that would work with nominals.  I realize your operator will work with any data type but with numericals I think you are much less likely to be searching for specific values to replace (since a continuous numerical attribute may have many individual values that are very infrequent).  
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    I'm not a native English speaker. I would have used "(discretize, summarize, replace, map, remap, regroup) weird values" (because "raro" is a common word in Spanish for both "weird" and "rare"). To be honest, I am not the kind of people who uses the search to discover new things because of the language gap.

    On a slightly humorous note: Yes, I have to think before reacting when someone says I am "as rare as a Unicorn", because my first instinct usually tells me that I am "as weird as a Unicorn".
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Thanks guys, i'll add a bunch of these!


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.