# Language modeling with RapidMiner

Hello,

I am trying to process a collection of text documents in the following manner:

1. For each document d compute a term frequency vector fv;

2. transform term frequencies in fv into probabilities and obtain a term probability vector pv

3. Use a precomputed language model lm to compute Kullback-Leibler divergence between fv and lm.

I was successful with completing step 1 Step 2 is less clear - how do I get the count of all terms in fv? Step 3 is totally unclear. My language model has a vertical layout, i.e. entries are roughly in form [term][probability of occurence]. The probability vector in turn has a horizontal layout - (p_of_term_1, ... p_of_term_n).

Any suggestions?

I am trying to process a collection of text documents in the following manner:

1. For each document d compute a term frequency vector fv;

2. transform term frequencies in fv into probabilities and obtain a term probability vector pv

3. Use a precomputed language model lm to compute Kullback-Leibler divergence between fv and lm.

I was successful with completing step 1 Step 2 is less clear - how do I get the count of all terms in fv? Step 3 is totally unclear. My language model has a vertical layout, i.e. entries are roughly in form [term][probability of occurence]. The probability vector in turn has a horizontal layout - (p_of_term_1, ... p_of_term_n).

Any suggestions?

0

## Answers

458UnicornThis might help.

http://rapidminernotes.blogspot.co.uk/2011/11/normalizing-rows.html

Andrew

4Contributor IThanks! Your suggestion actually helped complete step 2. There was one difficulty though - I had my label attribute mixed with the probabilities after 'Transpose' and therefore attribute statistics couldn't be computed properly and all sorts of other strange things happened. Hope this helps someone else.

Regarding step 3, i guess 'Transpose' operator can be used to join the document term probability vector and language model using the n-gram text as a key. Neat! Hopefully, there will not be any transposition problems due to lots of n-grams (my largest language model has around 92 000 n-grams).

Thanks again and have a nice evening!

15Contributor II>> I had my label attribute mixed with the probabilities after 'Transpose' and therefore attribute statistics couldn't >> be computed properly and all sorts of other strange things happened. Hope this helps someone else.

Dos the label attibute set as the lable role? If not, use the Set Role operator, and set the label attribute with the correct role.

I hope this help you,