"Text Mining Classification Problem"

bx01z · April 2010

Hello,

I am able to create a model using RM5, but I do not believe the algorithm I chose is working well. I have tried a number of algorithms, but I have tried SVM, NaiveBayes, W-SMO.

For the document, I Tokenize, Filter Stopwords (english), then Filter Tokens (by length) which is then sent to the classification algorithm.

I then take unlabeled data and process it and it classifies all as the same value.

I have 4 classifications with 500 labeled data for each for training.

Please provide guidance.

Thanks,
Bob

IngoRM · April 2010

Hi Bob,

phew, this question definitely reaches a limit of the amount of support we are able to provide for free in this forum, sorry. Questions like these are usually exactly the field we are working in consulting projects for our customers and often need much more time than just a few minutes of thinking and writing it down in a forum.

However, here are some hints for optimizing:

You could further try different preprocessing techniques like stemming, character or term-n-grams
If the texts are derived from specific domains, sometimes a dictionary for mapping terms can also help
You could try to use pruning or other (mild!) feature selection techniques
Try different modeling schemes and optimize their parameters
...

There are millions of options and it often needs a lot of experiences to come up with a good idea for a concrete case.

Cheers,
Ingo

bx01z · April 2010

Hi Ingo,

Thanks for the reply. Thanks for the suggestions. It's the stuff I've been trying, but I will press on. Perhaps I can ask a few simpler questions about how RM handles things.

1. When doing the data processing, is the label retained for the resulting dataset for each of the terms individually?
2. Is there a place to view accuracy levels of created models applied to the data used to create them?
3. Does RM use LingPipe at all?

Thanks again,
Bob

IngoRM · April 2010

Hi,

ad 1)
Sorry, I didn't understand this.

ad 2)
Again I am not sure if I got you. However, maybe you mean something like the result history in the result perspective which can show the latest results (just click on the colored bars to open the single results).

ad 3)
At least here I can be clear: LingPipe can not be supported in the free community edition of RapidMiner due to license issues. Although there is a royalty free license of LingPipe, this is not compatible to 100% open source licenses and strategies like that of RapidMiner, sorry. Of course it would be possible to build a custom connector to LingPipe within a customer project.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining Classification Problem"

Answers