One-Hot Encoding Top 10 Items (Fractional) Rest Other

ZarrokZarrok Member Posts: 3 Newbie
Hello together,

i am searching for a smart solution for One-Hot Encoding to the Top 10 (Fractional) Items. 
Currently I solve the problem by creating a new attribute for the top 10 values. For example:
  For each Attribute I need to generate a new Column:
if((contains([Attri],"Example Data")) ,1,0) 

Does anybody have a smart solution for this kind of issue ?

Kind regards,
ZaRRoK
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    likely just use Remove Rare Values first and then One Hot Encoding?
    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • ZarrokZarrok Member Posts: 3 Newbie
    edited July 22
    I understand what you mean, problem is rather that I have a large dataset with about 4000 groups, of which I would like to look at the top 100, the others should be defined as "Other". I would have 101 columns.
    The top 100 groups account for about 70% of the total.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Yeh, thats why I would propse to use the Remove Rare Values operator to replace all strings which are not in the top100 with "Other"?
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Zarrok
  • ZarrokZarrok Member Posts: 3 Newbie
    I have found a solution, but it does not make me happy... I have created a aggregation(fractional) which I then join back to the table. Then I create a new attribute, which after the appropriate share either takes over the attribute or defines it as " Other ".

Sign In or Register to comment.