# text classification problem with non mutually-exclusive classes

hi everyone!

i have got a bunch of documents and probabilities of its belonging to a specific class.

E.g.: text C1 C2 C3 C4 C5 C6 C7 bla bla... 10% 20% 60% 80% 0% 5% 30%

I want to train a model wich could predict these probabilities out of a given text.

As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!

To get in touch with rapidminer i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.

E.g.: text label bla.. C1 lorem.. C2 ipsum C7

Then i have weighted these documents the SVM weighter an take only those beyond a specific treshold (other featureselection methods, like forward or backward selection, did not find an ending after several hours)

Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.

Which performed pretty well, with an accuarcy of 93%...

But in the end, i still have no solution to my initial problem and no clue how to proceed:

Thank you in advance for your hints and suggestions!

i have got a bunch of documents and probabilities of its belonging to a specific class.

E.g.: text C1 C2 C3 C4 C5 C6 C7 bla bla... 10% 20% 60% 80% 0% 5% 30%

As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!

To get in touch with rapidminer i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.

E.g.: text label bla.. C1 lorem.. C2 ipsum C7

Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.

Which performed pretty well, with an accuarcy of 93%...

But in the end, i still have no solution to my initial problem and no clue how to proceed:

- should i try to get these probabilities out of the confidence vlaue from the svm some how? Is this possibile? And how?
- or train 7 linear regression models to predict these probabilities. But how to find a proper featureselection by over 2000 terms?
- or try it with a bayesian model which should give the probability of a class?

Thank you in advance for your hints and suggestions!

Tagged:

0

## Answers

3,453RM Data ScientistFirst of all - Did you use pruning in the Process Document operator? That way you might get rid of some unuseful attributes. Furthermore you should filter for stopwords etc. If you use X-Predction instead of X-Validation you get an example set including the confidences, that might help I don't think a linear regression model works well on text data. You could however try to use the SVM in the regression "mode". Simply use a numerical label with the standard SVM of rapidminer, than it does a regression instead of a classifcation

The proper feature selection is tricky. Of course a Forward Selection will not work on 2000 attributes. The first two steps might include 2000*1999 steps. I like the idea of the weight by SVM.

A baysian model might work. Additionally you could try an k-NN with cosine similarity. But this might take a while for the apply model.

Dortmund, Germany