How to set up model to categorize texts

gstar · January 2014

Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:

To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.

*the documents are relatively short and contain between 50 and 200 words

So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3. :-[

Thanks for any input!
Gstar

MariusHelf · January 2014

Hi Gstar,

for text mining Naive Bayes or a linear SVM usually do a good job.
Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.

Best regards,
Marius

gstar · January 2014

Great. Tanks! I'll try it and report back later!

gstar · January 2014

Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
Naive bayes performs worse.
I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).

Is there a workaround?

MariusHelf · January 2014

The operator Polynominal by Binominal classification is your friend in this case

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to set up model to categorize texts

Answers