Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Text Mining Classification Problem"
Hello,
I am able to create a model using RM5, but I do not believe the algorithm I chose is working well. I have tried a number of algorithms, but I have tried SVM, NaiveBayes, W-SMO.
For the document, I Tokenize, Filter Stopwords (english), then Filter Tokens (by length) which is then sent to the classification algorithm.
I then take unlabeled data and process it and it classifies all as the same value.
I have 4 classifications with 500 labeled data for each for training.
Please provide guidance.
Thanks,
Bob
I am able to create a model using RM5, but I do not believe the algorithm I chose is working well. I have tried a number of algorithms, but I have tried SVM, NaiveBayes, W-SMO.
For the document, I Tokenize, Filter Stopwords (english), then Filter Tokens (by length) which is then sent to the classification algorithm.
I then take unlabeled data and process it and it classifies all as the same value.
I have 4 classifications with 500 labeled data for each for training.
Please provide guidance.
Thanks,
Bob
Tagged:
0
Answers
phew, this question definitely reaches a limit of the amount of support we are able to provide for free in this forum, sorry. Questions like these are usually exactly the field we are working in consulting projects for our customers and often need much more time than just a few minutes of thinking and writing it down in a forum.
However, here are some hints for optimizing:
- You could further try different preprocessing techniques like stemming, character or term-n-grams
- If the texts are derived from specific domains, sometimes a dictionary for mapping terms can also help
- You could try to use pruning or other (mild!) feature selection techniques
- Try different modeling schemes and optimize their parameters
- ...
There are millions of options and it often needs a lot of experiences to come up with a good idea for a concrete case.Cheers,
Ingo
Thanks for the reply. Thanks for the suggestions. It's the stuff I've been trying, but I will press on. Perhaps I can ask a few simpler questions about how RM handles things.
1. When doing the data processing, is the label retained for the resulting dataset for each of the terms individually?
2. Is there a place to view accuracy levels of created models applied to the data used to create them?
3. Does RM use LingPipe at all?
Thanks again,
Bob
ad 1)
Sorry, I didn't understand this.
ad 2)
Again I am not sure if I got you. However, maybe you mean something like the result history in the result perspective which can show the latest results (just click on the colored bars to open the single results).
ad 3)
At least here I can be clear: LingPipe can not be supported in the free community edition of RapidMiner due to license issues. Although there is a royalty free license of LingPipe, this is not compatible to 100% open source licenses and strategies like that of RapidMiner, sorry. Of course it would be possible to build a custom connector to LingPipe within a customer project.
Cheers,
Ingo