Understanding the Document Similarity Score

kdafoe · February 2021

Hi. I have one news article pulled off the web. I manually summarized the contents myself by highlighting key phrases. I also run the document through Resommer.com to see what it would capture as a summary. I loaded both documents into RapidMiner, processed the document using Binary Term Occurrences, and then looked at the document similarity score (using Numerical Measures and CosineSimilarity) of .590. I interpreted this as meaning the documents have a similarity of 59%. To double check this I counted all the words (tokenized, change case, English stop words) that appeared twice or more. This totalled 64. The total number of attributes after processing the document is 157. Dividing 64 by 157 is .407 or 41%. The words that appeared only in one of the two documents totalled 93 or (157/93) .592 or 59%.

My question is, isn't the score really reporting the document dis-similarity, since it appears that the score chooses the words that don't appear across the two documents?

Thanks for any help.

Here is the XML.

<?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="9.3.001" expanded="true" height="68" name="My Summary" width="90" x="246" y="187">
        <parameter key="text" value="With each day of lockdown, daily life is boiled down a little more.

Lockdown is a good time to gather information, as the Star has just done on defining workers and industries considered crucial to keeping our nation functioning.

The numbers show that about 65 per cent of Toronto residents are considered essential workers, people in sectors that can remain open with some in-person staffing (some sectors including government were excluded for data collection reasons). 

They are more likely to have lower pay, no paid sick pay, are less unionized, more easily laid off, have fewer benefits and so on.

Overwhelmingly, packed into workspaces, public transit, and small homes, and forced into contact with the public, they contract the coronavirus in greater numbers.

How do we alter work so that fewer workers become used and abused, and then ill? 

There are bad ideas out there. Peel’s medical officer of health has recommended “the amount of non-essential items being purchased online” be restricted.

Who will decide what items delivered by Amazon are non-essential?

Amazon made Toronto’s lockdown survivable, in that it’s possible to stay home, to never go out. 

Now is the time for fresh thinking. 

Why is Ontario holding back federal money offered to mitigate lockdown pain? Should health care be federal? Should anti-maskers be under house arrest? Should unionization be mandatory? 
"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="9.3.001" expanded="true" height="68" name="Resoomer Summary" width="90" x="246" y="289">
        <parameter key="text" value="With each day of lockdown, daily life is boiled down a little more. Lockdown is a good time to gather information, as the Star has just done on defining workers and industries considered crucial to keeping our nation functioning. We all have a mental map of our nation and our city but Star reporters Sara Mojtehedzadeh and Andrew Bailey have gathered data to give us a map of Toronto lockdown employment, a visual most people wouldn’t have bothered with much in days when we breathed more easily. The numbers show that about 65 per cent of Toronto residents are considered essential workers, people in sectors that can remain open with some in-person staffing . 
 
We think of COVID-19 as a health survival story but as time progresses it becomes more about financial survival, often the same thing. Essential workers live differently. They are more likely to have lower pay, no paid sick pay, are less unionized, more easily laid off, have fewer benefits and so on. Overwhelmingly, packed into workspaces, public transit, and small homes, and forced into contact with the public, they contract the coronavirus in greater numbers. 
 
This is an immense public failure, a historic sorrow that builds day by day. Peel’s medical officer of health has recommended «the amount of non-essential items being purchased online» be restricted. Loh may be a fine health officer but a poor economist. Last month, Amazon delivered to my door light bulbs, weatherstripping, a dozen books, and an OXO Good Grips Soap Squirting Dish Brush. 
 
Amazon is essential, which is ironic because Amazon warehouse are notoriously hellish places to work. At the moment, Amazon is fighting its biggest labour battle ever on U. Amazon is evil"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="124" name="Process Documents" width="90" x="447" y="238">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34">
            <parameter key="transform_to" value="lower case"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="9.8.001" expanded="true" height="82" name="Data to Similarity" width="90" x="581" y="85">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
      </operator>
      <connect from_op="My Summary" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Resoomer Summary" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_port="result 2"/>
      <connect from_op="Data to Similarity" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
</operator>
</process>

Telcontar120 · February 2021

Cosine similarity is not the same similarity measure that I think you are trying to calculate. Instead, you should use a metric that directly compares the two values for each attribute (which corresponds to the tokens from your text process). If you use the Jaccard Similarity measure, you will find it is exactly 0.414, which is equivalent to what you calculated manually (in your process RapidMiner says there are 65 common terms out of 157, not 64). I would have to dig into the underlying math further but it may be that cosine similarity does in fact reduce to document dissimilarity when comparing only binary attribute sets. Cosine similarity is based on polar coordinate relationships, not cartesian/Euclidian relationships or value comparisons.

kdafoe · February 2021

Perfect. Thank you, Brian.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Understanding the Document Similarity Score

Best Answer

Answers