The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
text classification problem with non mutually-exclusive classes
hi everyone!
i have got a bunch of documents and probabilities of its belonging to a specific class.
E.g.:
I want to train a model wich could predict these probabilities out of a given text.
As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!
To get in touch with rapidminer i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.
E.g.:
Then i have weighted these documents the SVM weighter an take only those beyond a specific treshold (other featureselection methods, like forward or backward selection, did not find an ending after several hours)
Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.
Which performed pretty well, with an accuarcy of 93%...
But in the end, i still have no solution to my initial problem and no clue how to proceed:
Thank you in advance for your hints and suggestions!
i have got a bunch of documents and probabilities of its belonging to a specific class.
E.g.:
As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!
To get in touch with rapidminer i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.
E.g.:
Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.
Which performed pretty well, with an accuarcy of 93%...
But in the end, i still have no solution to my initial problem and no clue how to proceed:
- should i try to get these probabilities out of the confidence vlaue from the svm some how? Is this possibile? And how?
- or train 7 linear regression models to predict these probabilities. But how to find a proper featureselection by over 2000 terms?
- or try it with a bayesian model which should give the probability of a class?
Thank you in advance for your hints and suggestions!
Tagged:
0
Answers
First of all - Did you use pruning in the Process Document operator? That way you might get rid of some unuseful attributes. Furthermore you should filter for stopwords etc. If you use X-Predction instead of X-Validation you get an example set including the confidences, that might help I don't think a linear regression model works well on text data. You could however try to use the SVM in the regression "mode". Simply use a numerical label with the standard SVM of rapidminer, than it does a regression instead of a classifcation
The proper feature selection is tricky. Of course a Forward Selection will not work on 2000 attributes. The first two steps might include 2000*1999 steps. I like the idea of the weight by SVM.
A baysian model might work. Additionally you could try an k-NN with cosine similarity. But this might take a while for the apply model.
Dortmund, Germany