TermFrequency : how are WV values normalized?
I'm trying to understand the word vectors generated by the TextInput operator using Term Frequency. I believe the value is based on the normalized number of occurrences of terms in the document. How is this done?
To illustrate my question I have a parsimonious process below. The directory specified in the text parameter simply contains one .txt document that reads as: The word this occurs twice in this document.
(I'm aware that using TFIDF will return 0's in the word vector due to log(1/1)=0 in this setup, but TermFrequency should be ok.)
<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<parameter key="ClassLabelTest1" value="C:\Program Files\Downloads\RapidMiner\rm_workspace\20051016 TEST Seg1 Depth1 wRule\Test with one text even shorter"/>
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermFrequency"/>
<operator name="StringTokenizer" class="StringTokenizer">
The resulting ExampleSet (rounded) is
Perhaps I was expecting 8 (the number of occurrences of all terms in the document) in the denominator. If someone could provide a formula for the denominator or explain where this square root of 10 comes from, I would greatly appreciate it.
Thank you so much in advance,