"Naive Bayes for Text Classification"

cvidal · September 2015

Hello,

I'm trying to apply naive bayes to classifiy some texts and I have two questions about how rapidminer (v5.0.13) implement this classifier:

1.- As far as I know, one of the most frequently used classifier applied to text classification is multinomial naive bayes. The model obtained when using the naive bayes operator is composed by a set of means and standard deviations for the words of my corpus... So, which kind of naive bayes classifier is implemented in rapidminer (Multinomial, Gaussian, Bernouilli)?

2.- I have seen several examples of text classification applying naive bayes in rapidminer. Some of them uses the TF-IDF matrix as input when creating the model and when applying the model. I understand that TF-IDF values are used to make the model. However, I suppose that TF-IDF values are not used when applying the model (It would not make sense)... In fact, the "process documents" operator receive a Word List as input that modifies the "apply model" output. So,
a) Is it relevant how texts are vectorized (tf, tfidf, term occurrences) when applying naive bayes model?
b) Why does "process documents" operator receive a Word List, and how it is used when applying the model?

Thank you in advance.

MartinLiebig · September 2015

Hi!

For 1: Gaussian

For 2: There are two things to consider:
1. Which attributes to create? The Tokenize creates attributes for every word available in your documents (which is not pruned). In the apply phase you do not want to create attributes for words which were not in the training set and vice versa. So this is similar to the preprocessing model in Nominal to Numiercal.
2. TF-IDF contains some normalization. This needs to be applied in the apply phase as well.

cheers,
Martin

cvidal · September 2015

Thanks for your response.

I still do not see how can TF-IDF can be applied as input of "apply model" operator. I will try to explain myself:

If I understand TF-IDF correctly, it makes sense to calculate it when dealing with several (the more the better) documents. TF can be calculated for a single document, but IDF takes into account the rest of the documents of the corpus. So TF-IDF values will vary depending on the entire corpus.

If this is correct, there are several scenarios where applying tf-idf is not a good option, for example:
a) I want to classify only one comment (all the attributes-words values for tfidf matrix will be 0).
b) If I change the corpus tfidf values for one comment will change, so the classification could (probably will) change.

I have made some tests and I have seen that using the wordlist as input for process documents make the apply model operator change its output. I am not sure but it seems that when using the word list as input, the output (classification) is the same, regardless of the way (tf, tfidf, etc.) the vectors were created.

Can you help me clarifying this?

Thanks.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Naive Bayes for Text Classification"

Answers