Options

# TermFrequency : how are WV values normalized?

miwahattori
Member Posts:

**3**Contributor I
Hello all,

I'm trying to understand the word vectors generated by the TextInput operator using Term Frequency. I believe the value is based on the normalized number of occurrences of terms in the document. How is this done?

To illustrate my question I have a parsimonious process below. The directory specified in the text parameter simply contains

(I'm aware that using TFIDF will return 0's in the word vector due to log(1/1)=0 in this setup, but TermFrequency should be ok.)

The resulting ExampleSet (rounded) is

Noting 0.316 (well, without rounding) = 1/(square root of 10) and 0.632... = 2/(square root of 10), the WV[term i] appears to be the flat number of occurrences of term i, divided by square root of 10.

Perhaps I was expecting 8 (the number of occurrences of all terms in the document) in the denominator. If someone could provide a formula for the denominator or explain where this square root of 10 comes from, I would greatly appreciate it.

Thank you so much in advance,

Miwa

I'm trying to understand the word vectors generated by the TextInput operator using Term Frequency. I believe the value is based on the normalized number of occurrences of terms in the document. How is this done?

To illustrate my question I have a parsimonious process below. The directory specified in the text parameter simply contains

__one__.txt document that reads as:*The word this occurs twice in this document.*(I'm aware that using TFIDF will return 0's in the word vector due to log(1/1)=0 in this setup, but TermFrequency should be ok.)

<operator name="Root" class="Process" expanded="yes">

<operator name="TextInput" class="TextInput" expanded="yes">

<list key="texts">

<parameter key="ClassLabelTest1" value="C:\Program Files\Downloads\RapidMiner\rm_workspace\20051016 TEST Seg1 Depth1 wRule\Test with one text even shorter"/>

</list>

<parameter key="default_content_language" value="english"/>

<parameter key="vector_creation" value="TermFrequency"/>

<list key="namespaces">

</list>

<operator name="StringTokenizer" class="StringTokenizer">

</operator>

</operator>

</operator>

The resulting ExampleSet (rounded) is

Row Number | ID | Label | The | word | this | occurs | twice | in | document |

1 | 1 | ClassLabelTest1 | 0.316 | 0.316 | 0.632 | 0.316 | 0.316 | 0.316 | 0.316 |

Perhaps I was expecting 8 (the number of occurrences of all terms in the document) in the denominator. If someone could provide a formula for the denominator or explain where this square root of 10 comes from, I would greatly appreciate it.

Thank you so much in advance,

Miwa

0

## Answers

2,531Unicorneach term vector generated by using Term Frequency will be normalized in a way, that it's length (L2 Norm) is equal 1.

Greetings,

Sebastian