How does RapidMiner calculate Term Frequency (TF)?

el_chiefel_chief Member Posts: 63 Contributor II
edited November 2018 in Help
It doesn't seem to be #term occurrences / #words in document

For example, in this pretend document "safe safe horizontal counterexample tape tape tape occassion"

i get the following Term Frequencies:

counterexample = .250
horizontal = .250
occassion = .250
safe = .500
tape = .750

Thanks

Neil

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello Neil

    If you add another "tape", does the score change to 1.0?

    If you then add another new word, "fred" for example, does the score change to 0.8?

    If so, then I reckon the denominator is the number of unique words excluding the word in the numerator.

    Andrew
  • el_chiefel_chief Member Posts: 63 Contributor II
    hi awchisholm

    thanks for the message

    i changed the document to "safe safe horizontal counterexample tape tape tape occassion tape"

    TF is now

    counterexample .209
    horizontal .209
    occassion .209
    safe .417
    tape .834

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello Neil,

    How strange. I repeated your experiment and I reckon the denominator is the square root of the sum of the squares of the frequency of each unique word. The numerator is the frequency of the word being considered.

    So in the second example, the sum of the squares of the frequencies is 23 i.e. 4+1+1+16+1 and this would make the tf values 0.209, 0.417 and 0.834.

    Adding another "tape" makes the values 0.177, 0.354 and 0.884 which corresponds to a denominator of the square root of 32.


    regards

    Andrew
  • el_chiefel_chief Member Posts: 63 Contributor II
    Nice work, Andrew,

    Can someone at R-I confirm this?

    Thanks!

    Neil
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    here is the source code of the term frequency calculation:

            int numTerms = wordList.size();
            double totalTermNumber = 0;
            for (float value: frequencies)
            totalTermNumber += value;
           
            // Create the result structure
            double[] wv = new double[numTerms];

            // If document contains at least one term
            if (totalTermNumber > 0) {
                // Create the vector
                double length = 0.0;
                for (int i = 0; i < wv.length; i++) {
                    wv = frequencies / totalTermNumber;
                    length += wv * wv;
                }

                length = Math.sqrt(length);

                // Normalize the vector
                if (length > 0.0)
                    for (int i = 0; i < wv.length; i++)
                        wv = wv / length;
            }
            return wv;

    As you can see, the "expected" term frequency, that is number of occurences of the term in the document divided by the total number of terms in the document is calculated as

                    wv = frequencies / totalTermNumber;
    After that, we normalize the calculated frequencies by the square root of the sum of all frequencies of this document (the part at // Normalize Vector). Please note that this normalization is only done if "term frequency" is selected. If you select TFIDF, the "usual" term frequency is first calculated and multiplied with the IDF part before a similar normalization (dividing by the square root of the sum of all TFIDF values) is performed.

    Why do we normalize? Well, this normalization ensures that the L2-norm of the vectors will all be 1. And this make them better suitable for comparisons and similarity calculations and I would recommend this normalization in general over for example simply dividing by the maximum. By the way: with this L2-normalization the cosine similarity simply equals the scalar product of the vectors. Therefore, this normalization is also known as cosine normalization.

    Cheers,
    Ingo
  • TomDocTomDoc Member Posts: 1 Contributor I
    " If you select TFIDF, the "usual" term frequency is first calculated and multiplied with the IDF part ..."
    Could you tell me please, how the "IDF part" is realized? Is there something like: IDF(t) = log (Total number of documents / Number of documents with term t in it)


    Thank you.
    BR
    Thomas
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Thomas,

    yes, that's the formula for the inverse document frequency (IDF).

    Best regards,
    Marius
Sign In or Register to comment.