"How to improve Classification in Text Mining"

mdc · November 2009

I'm doing classification (15 classes) of technical papers using their abstract.

My processes are simple.

Learning:
+ TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ Binary2MultiClassLearner
+LibSVMLearner
+ModelWriter

Applying:
+TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter

I get results but I'm not satisfied with them. How do I improve them? ???

I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong? ???
....
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog

I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?

thanks,
Matthew

fischer · November 2009

Hi,

Regarding the feature selection: What you want to do is probably not to use a ModelApplier, but rather save the attribute weights (AttributeWeightsWriter) and apply them (AttributeWeightsApplier).

Regarding the optimization of the setup: There is no general answer. Try optimizing parameters of the SVM and of the text input, try adding term n-grams, etc., maybe add a dictionary for synonyms. It very much depends on your texts.

Cheers,
Simon

haddock · November 2009

Hi,

Sometimes it is tempting to tweak the answer, and to forget about whether the question makes any sense. Fifteen classes? Think how many examples would be necessary to represent the problem space.

fischer · November 2009

Umh. Yes. Actually, I missed that part in the original post. I agree with haddock. If you have 15 classes it is not particularly surprising that you are not satisfied with the results :-)

mdc · November 2009

Thanks guys for the answers. I was actually thinking of adding more classes.

Then what is the ideal number of classes for text classification? And how do you solve the problem of classifying technical documents into many categories ---is data mining not the solution?

Matthew

mdc · November 2009

Also, where would you add the AttributeWeightsWriter operator in this example?

+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog

thanks,
Matthew

steffen · November 2009

Jumping in ...

Of course, Data Mining is the solution ;D

Regarding the number of classes: What haddock meant was that you need a lot of examples / documents per category to a) have enough information to distinguish the classes and b) to make any statistical reliable performance estimates. So ... how many do you have ?

Low performance values are an indication that the classes cannot be easily distinguished. Here are some rough ideas:

If the classes are the leafes of a hierachy, try to go up the hierarchy and merge classes (i.e. class "network administration" and "software engineering" into "computer science") to see whether the results get better. Performing Feature Selection on different "levels" and comparing the results manually may give you a better feeling where the problem is located
Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)

Regarding the posted process: After the FeatureSelectionOperator

regards,

Steffen

mdc · November 2009

Thanks for clarifying that up. I almost lost hope there.

For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.

You're right. One reason my classification did not have good result was overlap with the categories. There are categories that I should have combined. But is it possible to do hierarchical categorization in RapidMiner? Sort of a superclass for some group of classes. So when the program can not decide between two classes, it will choose their superclass.

Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)

Do you have an example for this?

Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?

thanks a lot.

Matthew

steffen · November 2009

Hello again

For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.

We talking about a statistical problem here. I will give you another example: You are given a 6-sided dice and now have to decide whether this dice is fair or not. How often do you have to throw the dice to tell ? (Wikipedia - Statistical Test). In your case of 15 classes the question is interesting which performance you have to achieve to be better than random (1/15) ? I cannot cover this topic here, but there is a lot of statistical literature out there to calculate all these numbers (i.e. number of examples per category, minimum performance etc.).

RapidMiner offers the standard t-test ... but before we start testing, let's see if we can achieve some improvements at all.

But is it possible to do hierarchical categorization in RapidMiner?

Like Haddock once said (oh, I should add this one to my signature), "RapidMiner is like Lego". You can achieve nearly anything with the right combination of operators. I will give you some hints:

AttributeConstruction in combination with ChangeAttributeRole or ExchangeAttributeRoles to aggregate labels
ProcessBranch to realize an if-else-statement
ValueIterator allows you to iterate over the values of your label attribute
ProcessLog to log the performance

It is quite hard to create an automatic process, which finds the optimal merge of categories for your problem. Indeed, it would take more than an one hour (or more) for an experienced user, so I suggest that you try manual combinations (including the domain knowledge you have) to get a better feeling which classes to merge. Please understand that I cannot provide a complete process here. Play around and I will guarantee that you will appreciate RapidMiner more and more

.

Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?

The AttributeWeight is an indication of how important the attribute is for distinction of the classes. In case of FeatureSelection it is always 1 or 0 (use it or dont), other operators (like InformationGainWeighting) provide a less crisp evaluation. Use the operator AttributeWeightSelection to filter the attributes to remove redudant or (worse) disturbing information.
As I said above, the optimal featureset may / will depend on the current "merge situation" of your categories.

I wish you success

regards,

Steffen

PS: If it wont work, try this: http://www.youtube.com/watch?v=egfCXLHfw-M ; (cannot get rid of this song

)

mdc · November 2009

I guess I'll need to spend a lot more time with Rapidminer to become familiar with all the operators. In the meantime, I'll try the basic classification first before I go to hierarchical one.

Last question: For the Feature Selection, do you apply Feature Selection for one class only or to more than one class? What I mean is how many classes to input in the TextInput operator. I tried both. The Feature Selection with one class runs fast but the one with many classes failed. The error message shows "outofmemoryError: Java heap space". Is it ok to run Feature Selection separately for each class then combine the attribute weight results later on.

thanks,
Matthew

steffen · November 2009

Hello Mathew

I suppose that you mean with "one class" "one class vs all other classes", otherwise it makes no sense. As told above, the FeatureSelection tries to find a feature set which contains enough / exactly the information (limited to the information available through the data) you need to separate the classes given the current classification problem aka label.

That means the feature set will most probably change when you change the label. So it makes no sense to say which is the correct strategy, the question is what do you want to achieve and (as we have seen above) what can be learned.

If you have memory problems try the operator GeneticAlgorithmn instead, which delivers comparable results.

regards,

Steffen

PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense

.

land · November 2009

Hi,
for FeatureSelection, you will need to have all classes of your classification task, because the selection optimizes the feature set for exactly this classification task. That's why, theres a learner and a Crossvalidation inside: To estimate the performance in this classification task on the current attribute set.
If your data set contains only one class, you don't need any feature at all, hence the forward selection is very fast. The performance is simply always 100%, with or without features.

If you need forward selection and the genetic selection doesn't fit your need, we provide a plugin with an improved and very memory efficient version of the FeatureSelection. You might ask for a quote, if you want.

Greetings,
Sebastian

mdc · November 2009

Hi,

PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense

Can you suggest a good text mining book? My application is limited to text mining and the text mining book I have is not enough to understand most of the operators in RM. I doubt though that there is a text mining book that can explain most of the RM operators just like the book you suggested. I'll buy it anyways.
I think Rapid-I should publish a book in data mining using RM. The content of this forum is more than enough to fill a book.

thanks,
Matthew

land · November 2009

Hi,
you won't believe but we are working on a book...

Greetings,
Sebastian

mdc · November 2009

Sebastian Land wrote:

you won't believe but we are working on a book...

That's good news. When can we expect this book?

Matthew

land · November 2009

Hi,
this depends on our workload for other projects and such stuff. A first introductory part should be published together with the final release. Let's hope we get it done until then...

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"How to improve Classification in Text Mining"

Answers