# TermFrequency : how are WV values normalized?

Member Posts: 3 Contributor I
edited November 2018 in Help
Hello all,
I'm trying to understand the word vectors generated by the TextInput operator using Term Frequency.  I believe the value is based on the normalized number of occurrences of terms in the document.  How is this done?

To illustrate my question I have a parsimonious process below.  The directory specified in the text parameter simply contains one .txt document that reads as: The word this occurs twice in this document.

(I'm aware that using TFIDF will return 0's in the word vector due to log(1/1)=0 in this setup, but TermFrequency should be ok.)
<operator name="Root" class="Process" expanded="yes">    <operator name="TextInput" class="TextInput" expanded="yes">        <list key="texts">          <parameter key="ClassLabelTest1"	value="C:\Program Files\Downloads\RapidMiner\rm_workspace\20051016 TEST Seg1 Depth1 wRule\Test with one text even shorter"/>        </list>        <parameter key="default_content_language"	value="english"/>        <parameter key="vector_creation"	value="TermFrequency"/>        <list key="namespaces">        </list>        <operator name="StringTokenizer" class="StringTokenizer">        </operator>    </operator></operator>

The resulting ExampleSet (rounded) is
 Row Number ID Label The word this occurs twice in document 1 1 ClassLabelTest1 0.316 0.316 0.632 0.316 0.316 0.316 0.316
Noting 0.316 (well, without rounding) = 1/(square root of 10) and 0.632... = 2/(square root of 10), the WV[term i] appears to be the flat number of occurrences of term i, divided by square root of 10.

Perhaps I was expecting 8 (the number of occurrences of all terms in the document) in the denominator.  If someone could provide a formula for the denominator or explain where this square root of 10 comes from, I would greatly appreciate it.

Thank you so much in advance,
Miwa

• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,529 Unicorn