Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Creating model to categorize data

MarlaBotMarlaBot Employee, Member Posts: 57 Community Manager
edited February 2019 in Help
A RapidMiner user wants to know the answer to this question: "I have a list of about 120 values that serve as categories. I have to be able to predict what category a value belongs to based on it's other attribute. The values that I am training on are associated with one of these items. I need to create a model that will categorize the combination of values from other columns and predict what category it belongs in. I have tried to use a decision tree and it does not seem to be doing very well. There are too many categories and it keeps making poor predictions. Any suggestions? Thank you."

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Hi,
    is there any way to use a taxonomy between the 120 classes?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    There are probably too many categories and not enough cases in many of them for the algorithm to detect patterns all at once.  You have a couple of options:
    • Create groupings of these categories (this is the taxonomy that Martin mentioned above) so you end up with a much smaller number of super-categories and try to build a model to predict those.  Ideally you would have pretty robust counts in each of the super-categories and not too many of them (e.g., 12 would be much better than 120!).
    • Find the dominant categories (once again by count) and create a series of "one vs all other" models.  This would require you to build multiple models but will give you more control over the specific categories selected.
    • Or you could do a hybrid of the two methods above.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.