A question about naive bayes based text classification

gfyanggfyang Member Posts: 29 Maven
edited July 2019 in Help
Hi,

I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?

However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?

Sincerely yours,
gfyang

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    NaiveBayes is a general learning algorithm working on tables. You might use it in order to do text classification, but it is applicable on all other problems, too.
    Although the original TF-IDF values of the documents were calculated using the word list, Naive Bayes doesn't know them. It just takes the example set into consideration.
    On the other hand, if you apply a weight transformation on all examples of the example set in the same way, the naive bayes result shouldn't differ, because it treats all attributes as independent from each other. But there might be some numerical problems in the limits of computer's precision, causing slightly different results.

    Greetings,
      Sebastian
  • gfyanggfyang Member Posts: 29 Maven
    Hi, Sebastian,

    Thank you for the reply.

    I tested several experiments. For example, I multiply all the TF-IDF values with the same weight, and then I change the weight, which is applied to all the TF-IDF values again. The results show that such weight adjustment could really change the accuracy, although all the TF-IDF values are adjusted by exactly the same weight.

    double precision=0.0;

    Iterator<Attribute> attributeIterator; // the iterator for all attributes
    Iterator<Example> exampleIterator; // the iterator for all examples

    // save the text vector into array
    double text_array[][] = new double [num_exp][num_att-2];
    exampleIterator = exampleSet.iterator(); // move the iterator to the begining
    for(int i=0; i<num_exp; i++)
    {
    Example example = exampleIterator.next(); // read one example
    attributeIterator = attributes.allAttributes(); // build the iterator for the attributes
    for(int j=0; j<num_att-2; j++) // read all the attributes except that last two
    {
    Attribute att = attributeIterator.next();
    text_array = example.getValue(att); // read the TF-IDF value into array
    }
    }

    // adjust TF-IDF with weights
    double fWeight = 0;
    for(int i=0; i<20; i++)
    {
    exampleIterator = exampleSet.iterator(); // move the iterator to the beginning
    for(int i2=0; i2<num_exp; i2++)
    {
    Example example = exampleIterator.next();
    attributeIterator = attributes.allAttributes();
    for(int j=0; j<num_att-2; j++)
    {
    Attribute att = attributeIterator.next();
    double val = text_array[i2] * fWeight; // adjust the TF-IDF by weight
    example.setValue(att, val); // save the adjusted TF-IDF into the ExampleSet
    }
    }

    precision = my_validate_classiciation(); // do classification by naive bayes based on the adjusted TF-IDF
    System.out.println("(" + fWeight + "): " + precision + " ");

    fWeight += 0.05; // increase the weight
    fWeight = roundTwoDecimals(fWeight); // keep two places behind the decimal point
    }
    The results are:

    (weight): precision
    (0.0): 0.0
    (0.05): 0.3875
    (0.1): 0.3125
    (0.15): 0.3125
    (0.2): 0.3125
    (0.25): 0.2875
    (0.3): 0.275
    (0.35): 0.2625
    (0.4): 0.2625
    (0.45): 0.2625
    (0.5): 0.25
    (0.55): 0.25
    (0.6): 0.25
    (0.65): 0.2375
    (0.7): 0.2375
    (0.75): 0.2375
    (0.8): 0.2375
    (0.85): 0.2375
    (0.9): 0.2375
    (0.95): 0.2375
    (1.0): 0.2375
    It seems that the differences in the results are too large to be ignored, which might not be caused by the computer precision problem.

    So, I guess that when doing NB classification by RM, this algorithm really reads ExampleSet and has some important calculations based on ExampleSet, which affects the precision directly.

    Sincerely yours,
    gfyang
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    which version of rapid miner do you use?

    By the way: There are many methods in the rapid miner api, which would make your life simpler...

    Greetings,
      Sebastian
  • gfyanggfyang Member Posts: 29 Maven
    Hi,

    The version of my RM is 4.5.

    I am developing a new idea to adjust the text vector, and I want to test this idea on several classic classification methods. I will try the other methods later. :) Thank you for the help.

    Sincerely yours,
    gfyang
Sign In or Register to comment.