Options

"How to improve Classification in Text Mining"

mdcmdc Member Posts: 58 Maven
edited May 2019 in Help
I'm doing classification (15 classes) of technical papers using their abstract.

My processes are simple.

Learning:
+ TextInput
  + String Tokenizer
  + English StopwordFilter
  +TokenLengthFilter
+ Binary2MultiClassLearner
  +LibSVMLearner
+ModelWriter

Applying:
+TextInput
  + String Tokenizer
  + English StopwordFilter
  +TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter

I get results but I'm not satisfied with them. How do I improve them?  ???

I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong?  ???
....
+ FeatureSelection
  +XValidation
      +NearestNeighbors
      +OperatorChain
          +ModelApplier
          +Performance
  +ProcessLog

I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?

thanks,
Matthew

Answers

  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    Regarding the feature selection: What you want to do is probably not to use a ModelApplier, but rather save the attribute weights (AttributeWeightsWriter) and apply them (AttributeWeightsApplier).

    Regarding the optimization of the setup: There is no general answer. Try optimizing parameters of the SVM and of the text input, try adding term n-grams, etc., maybe add a dictionary for synonyms. It very much depends on your texts.

    Cheers,
    Simon
  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi,

    Sometimes it is tempting to tweak the answer, and to forget about whether the question makes any sense. Fifteen classes? Think how many examples would be necessary to represent the problem space.

  • Options
    fischerfischer Member Posts: 439 Maven
    Umh. Yes. Actually, I missed that part in the original post. I agree with haddock. If you have 15 classes it is not particularly surprising that you are not satisfied with the results :-)
  • Options
    mdcmdc Member Posts: 58 Maven
    Thanks guys for the answers. I was actually thinking of adding more classes.

    Then what is the ideal number of classes for text classification? And how do you solve the problem of classifying technical documents into many categories ---is data mining not the solution?

    Matthew
  • Options
    mdcmdc Member Posts: 58 Maven

    Also, where would you add the AttributeWeightsWriter operator in this example?

    + FeatureSelection
      +XValidation
          +NearestNeighbors
          +OperatorChain
              +ModelApplier
              +Performance
      +ProcessLog

    thanks,
    Matthew
  • Options
    steffensteffen Member Posts: 347 Maven
    Jumping in ...

    Of course, Data Mining is the solution  ;D

    Regarding the number of classes: What haddock meant was that you need a lot of examples / documents per category to a) have enough information to distinguish the classes and b) to make any statistical reliable performance estimates. So ... how many do you have ?

    Low performance values are an indication that the classes cannot be easily distinguished. Here are some rough ideas:
    • If the classes are the leafes of a hierachy, try to go up the hierarchy and merge classes (i.e. class "network administration" and "software engineering" into "computer science") to see whether the results get better. Performing Feature Selection on different "levels" and comparing the results manually may give you a better feeling where the problem is located
    • Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)
    Regarding the posted process: After the FeatureSelectionOperator

    regards,

    Steffen
  • Options
    mdcmdc Member Posts: 58 Maven
    Thanks for clarifying that up. I almost lost hope there.

    For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the  abstract section of the documents.

    You're right. One reason my classification did not have good result was overlap with the categories. There are categories that I should have combined. But is it possible to do hierarchical categorization in RapidMiner? Sort of a superclass for some group of classes. So when the program can not decide between two classes, it will choose their superclass.
    Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)
    Do you have an example for this?

    Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?

    thanks a lot.

    Matthew

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello again

    For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the  abstract section of the documents.
    We talking about a statistical problem here. I will give you another example: You are given a 6-sided dice and now have to decide whether this dice is fair or not. How often do you have to throw the dice to tell ? (Wikipedia - Statistical Test). In your case of 15 classes the question is interesting which performance you have to achieve to be better than random (1/15) ? I cannot cover this topic here, but there is a lot of statistical literature out there to calculate all these numbers (i.e. number of examples per category, minimum performance etc.).

    RapidMiner offers the standard t-test ... but before we start testing, let's see if we can achieve some improvements at all.

    But is it possible to do hierarchical categorization in RapidMiner?
    Like Haddock once said (oh, I should add this one to my signature), "RapidMiner is like Lego". You can achieve nearly anything with the right combination of operators. I will give you some hints:
    • AttributeConstruction in combination with ChangeAttributeRole or ExchangeAttributeRoles to aggregate labels
    • ProcessBranch to realize an if-else-statement
    • ValueIterator allows you to iterate over the values of your label attribute
    • ProcessLog to log the performance
    It is quite hard to create an automatic process, which finds the optimal merge of categories for your problem. Indeed, it would take  more than an one hour (or more) for an experienced user, so I suggest that you try manual combinations (including the domain knowledge you have) to get a better feeling which classes to merge. Please understand that I cannot provide a complete process here.  Play around and I will guarantee that you will appreciate RapidMiner more and more ;).

    Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?
    The AttributeWeight is an indication of how important the attribute is for distinction of the classes. In case of FeatureSelection it is always 1 or 0 (use it or dont), other operators (like InformationGainWeighting) provide a less crisp evaluation. Use the operator AttributeWeightSelection to filter the attributes to remove redudant or (worse) disturbing information.
    As I said above, the optimal featureset may / will depend on the current "merge situation" of your categories.

    I wish you success

    regards,

    Steffen

    PS: If it wont work, try this: http://www.youtube.com/watch?v=egfCXLHfw-M ; (cannot get rid of this song  :()
  • Options
    mdcmdc Member Posts: 58 Maven
    I guess I'll need to spend a lot more time with Rapidminer to become familiar with all the operators. In the meantime, I'll try the basic classification first before I go to hierarchical one.

    Last question: For the Feature Selection, do you apply Feature Selection for one class only or to more than one class? What I mean is how many classes to input in the TextInput operator. I tried both. The Feature Selection with one class runs fast but the one with many classes failed. The error message shows "outofmemoryError: Java heap space". Is it ok to run Feature Selection separately for each class then combine the attribute weight results later on.

    thanks,
    Matthew
  • Options
    steffensteffen Member Posts: 347 Maven
    Hello Mathew

    I suppose that you mean with "one class" "one class vs all other classes", otherwise it makes no sense. As told above, the FeatureSelection tries to find a feature set which contains enough / exactly the information (limited to the information available through the data) you need to separate the classes given the current classification problem aka label.

    That means the feature set will most probably change when you change the label. So it makes no sense to say which is the correct strategy, the question is what do you want to achieve and (as we have seen above) what can be learned.

    If you have memory problems try  the operator GeneticAlgorithmn instead, which delivers comparable results.

    regards,

    Steffen

    PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense ;).
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    for FeatureSelection, you will need to have all classes of your classification task, because the selection optimizes the feature set for exactly this classification task. That's why, theres a learner and a Crossvalidation inside: To estimate the performance in this classification task on the current attribute set.
    If your data set contains only one class, you don't need any feature at all, hence the forward selection is very fast. The performance is simply always 100%, with or without features.

    If you need forward selection and the genetic selection doesn't fit your need, we provide a plugin with an improved and very memory efficient version of the FeatureSelection. You might ask for a quote, if you want.

    Greetings,
      Sebastian
  • Options
    mdcmdc Member Posts: 58 Maven
    Hi,
    PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense
    Can you suggest a good text mining book? My application is limited to text mining and the text mining book I have is not  enough to understand most of the operators in RM. I doubt though that there is a text mining book that can explain most of the RM operators just like the book you suggested. I'll buy it anyways.
    I think Rapid-I should publish a book in data mining using RM. The content of this forum is more than enough to fill a book.

    thanks,
    Matthew
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you won't believe but we are working on a book...

    Greetings,
      Sebastian
  • Options
    mdcmdc Member Posts: 58 Maven
    Sebastian Land wrote:

    you won't believe but we are working on a book...
    That's good news. When can we expect this book?

    Matthew
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this depends on our workload for other projects and such stuff. A first introductory part should be published together with the final release. Let's hope we get it done until then...

    Greetings,
      Sebastian
Sign In or Register to comment.