Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the

**Register**button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.# "Naive Bayes for Text Classification"

Hello,

I'm trying to apply naive bayes to classifiy some texts and I have two questions about how rapidminer (v5.0.13) implement this classifier:

1.- As far as I know, one of the most frequently used classifier applied to text classification is multinomial naive bayes. The model obtained when using the naive bayes operator is composed by a set of means and standard deviations for the words of my corpus... So,

2.- I have seen several examples of text classification applying naive bayes in rapidminer. Some of them uses the TF-IDF matrix as input when creating the model and when applying the model. I understand that TF-IDF values are used to

a) Is it relevant how texts are vectorized (tf, tfidf, term occurrences) when applying naive bayes model?

b) Why does "process documents" operator receive a Word List, and how it is used when applying the model?

Thank you in advance.

I'm trying to apply naive bayes to classifiy some texts and I have two questions about how rapidminer (v5.0.13) implement this classifier:

1.- As far as I know, one of the most frequently used classifier applied to text classification is multinomial naive bayes. The model obtained when using the naive bayes operator is composed by a set of means and standard deviations for the words of my corpus... So,

**which kind of naive bayes classifier is implemented in rapidminer (Multinomial, Gaussian, Bernouilli)?**2.- I have seen several examples of text classification applying naive bayes in rapidminer. Some of them uses the TF-IDF matrix as input when creating the model and when applying the model. I understand that TF-IDF values are used to

**make**the model. However, I suppose that TF-IDF values are not used when**applying**the model (It would not make sense)... In fact, the "process documents" operator receive a Word List as input that modifies the "apply model" output. So,a) Is it relevant how texts are vectorized (tf, tfidf, term occurrences) when applying naive bayes model?

b) Why does "process documents" operator receive a Word List, and how it is used when applying the model?

Thank you in advance.

Tagged:

0

## Answers

3,528RM Data ScientistFor 1: Gaussian

For 2: There are two things to consider:

1. Which attributes to create? The Tokenize creates attributes for every word available in your documents (which is not pruned). In the apply phase you do not want to create attributes for words which were not in the training set and vice versa. So this is similar to the preprocessing model in Nominal to Numiercal.

2. TF-IDF contains some normalization. This needs to be applied in the apply phase as well.

cheers,

Martin

Dortmund, Germany

2Contributor II still do not see how can TF-IDF can be applied as input of "apply model" operator. I will try to explain myself:

If I understand TF-IDF correctly, it makes sense to calculate it when dealing with several (the more the better) documents. TF can be calculated for a single document, but IDF takes into account the rest of the documents of the corpus. So TF-IDF values will vary depending on the entire corpus.

If this is correct, there are several scenarios where applying tf-idf is not a good option, for example:

a) I want to classify only one comment (all the attributes-words values for tfidf matrix will be 0).

b) If I change the corpus tfidf values for one comment will change, so the classification could (probably will) change.

I have made some tests and I have seen that using the wordlist as input for process documents make the apply model operator change its output. I am not sure but it seems that when using the word list as input, the output (classification) is the same, regardless of the way (tf, tfidf, etc.) the vectors were created.

Can you help me clarifying this?

Thanks.