Another query on document classifcation and assigning of weights to keywords

S_S_ Member Posts: 7 Contributor II
edited December 2019 in Help
Hi,

Thanks for the response earlier.

I have a couple of more questions on document classification although unrelated to what I asked last time around.

+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate).  Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.

+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.

+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.

+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?

Thanks.
Regards,
Sharath

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi again

    + I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate).  Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
    Kind of. First thing is, that the model has a prior. in case it does know anything it might predict the most frequent class. So for this i would balance it
    Further accuracy as a measure is highly class balance dependend. If you have unbalanced data, accuracay becomes hard to interpret.

    Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
    In personally think that it is not that important, because most stop words are thrown out by TF/IDF or Feature selection
    As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
    so you would simply count? Yes it is. I built a process like this somewhere here in the forum.


    Btw: Have you tried a linear SVM?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • S_S_ Member Posts: 7 Contributor II
    Thanks a lot for the prompt response Michael! This really helps.

    I think you missed you missed out on responding to my query on assigning weights. Would appreciate if you could respond to this one as well.


    Thanks.
    Regards,
    Sharath
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    the answer is basicly you can not add weights for attributes, only for examples. The reason for it is that most models choos his weights "by its own". Think about a linear regression. Their you do not want to change the coefficients ( ~weights) by your own.
    The only thing you can do is dupicating attributes.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • S_S_ Member Posts: 7 Contributor II
    In my case I have only two columns - 1. Subject + Content of an email  2. Email label (the category to which it belongs to - Operations, Finance etc.)

    Could you please elaborate a bit on what you mean by adding weights to examples and not attributes with reference to my case above?

    Also when you say duplicating attributes do you mean duplicating certain mails (in my case) that are very descriptive and have a lot of keywords before building a model?

    Thanks again.

    Sharath
  • S_S_ Member Posts: 7 Contributor II
    One more thing, when I say adding weights I do not refer to the coefficients of a model but something similar to oversampling and undersampling (i.e. giving more weight to certain records that are more descriptive than some of the others).
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    oh, in that case:

    add another coloumn with Generate attributes and set the role of it to weight. Then all learners who can handle weights will use them.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.