Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

How to set up model to categorize texts

gstargstar Member Posts: 3 Contributor I
edited November 2018 in Help
Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:

To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.

*the documents are relatively short and contain between 50 and 200 words

So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3.  :-[

Thanks for any input!
Gstar

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Gstar,

    for text mining Naive Bayes or a linear SVM usually do a good job.
    Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.

    Best regards,
    Marius
  • gstargstar Member Posts: 3 Contributor I
    Great. Tanks! I'll try it and report back later!
  • gstargstar Member Posts: 3 Contributor I
    Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
    Naive bayes performs worse.
    I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).

    Is there a workaround?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    The operator Polynominal by Binominal classification is your friend in this case :)

    Best regards,
    Marius
Sign In or Register to comment.