Text Mining - Documents Similarity (words position)

silviabastos · February 2018

Hello,

I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence.
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D

Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...

I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.

Best regards,
Silvia

Telcontar120 · February 2018

Instead of tokenizing your documents, you may want want to take a look at "Data to Similarity" which allows the computation of various types of nominal distances between entities. I am not familar with all the details of several of those distance metrics (Dice, Jaccard, Tanimoto, etc.) but it is possible that one or more of them is suitable for your purposes.

yyhuang · February 2018

Hi @silviabastos

This is a great questions. To 'remember' to location of the key words, you can use "generate nGrams" for phrases search with term max length for 7 + and of course it will need more time for text processing.

Supppose you do not have many words in each document, ideally just like the examples showed in your message, we have three documents as simple as

A B C D E F G

A X B D Y F G

G F E A B C D

You can use the levenshtein distance offered in Dr Martin Schmitz's toolbox extension. https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox

The Levenshtein distance is calculated as the number of changes needed to convert one string into the other. A common use case for this distance is spell checking.

Here is the xml of my process. HTH!

YY

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
        <parameter key="text" value="A B C D E F G "/>
        <description align="center" color="transparent" colored="false" width="126">A B C D E F G</description>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (2)" width="90" x="112" y="187">
        <parameter key="text" value="A X B D Y F G"/>
        <description align="center" color="transparent" colored="false" width="126">A X B D Y F G</description>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (3)" width="90" x="112" y="340">
        <parameter key="text" value="G F E A B C D"/>
        <description align="center" color="transparent" colored="false" width="126">G F E A B C D</description>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="124" name="Documents to Data" width="90" x="313" y="34">
        <parameter key="text_attribute" value="string"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="8.1.000" expanded="true" height="82" name="Generate ID" width="90" x="447" y="34"/>
      <operator activated="true" class="multiply" compatibility="8.1.000" expanded="true" height="103" name="Multiply" width="90" x="581" y="34"/>
      <operator activated="true" class="cross_distances" compatibility="8.1.000" expanded="true" height="103" name="Cross Distances" width="90" x="715" y="238"/>
      <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="916" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="distance.ne.0"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join" width="90" x="1050" y="136">
        <parameter key="join_type" value="left"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="document" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="rename" compatibility="8.1.000" expanded="true" height="82" name="Rename" width="90" x="1184" y="136">
        <parameter key="old_name" value="string"/>
        <parameter key="new_name" value="document string"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (2)" width="90" x="1251" y="289">
        <parameter key="join_type" value="left"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="request" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="rename" compatibility="8.1.000" expanded="true" height="82" name="Rename (2)" width="90" x="1385" y="289">
        <parameter key="old_name" value="string"/>
        <parameter key="new_name" value="request string"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="operator_toolbox:levenshtein_distance" compatibility="0.9.000" expanded="true" height="82" name="Generate Levenshtein Distance" width="90" x="1519" y="289">
        <parameter key="First Attribute for Distance Calculation" value="document string"/>
        <parameter key="Second Attribute for Distance Calculation" value="request string"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
      <connect from_op="Create Document (3)" from_port="output" to_op="Documents to Data" to_port="documents 3"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Cross Distances" from_port="request set" to_op="Join" to_port="right"/>
      <connect from_op="Cross Distances" from_port="reference set" to_op="Join (2)" to_port="right"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Join" from_port="join" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Join (2)" to_port="left"/>
      <connect from_op="Join (2)" from_port="join" to_op="Rename (2)" to_port="example set input"/>
      <connect from_op="Rename (2)" from_port="example set output" to_op="Generate Levenshtein Distance" to_port="exa"/>
      <connect from_op="Generate Levenshtein Distance" from_port="out" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="252"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

silviabastos · February 2018

Hi!

I will try both options.

Related to @yyhuang solution, I only wrote a small example in the first post, the texts I'm working have natural language, about 900 words, so I'm not sure if I can use it.

Related to @Telcontar120 solution, I make one first attempt, but I didn't get consistent results.

I will work a little more io this and I will post the found problems.

Any other solutions are wellcome.

Thank you.

yyhuang · February 2018

Hi @silviabastos

Thanks for the followup! Maybe you can try word2vec for document with 900+ words?

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Training on a single corpus the word2vec algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings that help you understand the position and context of each word.

You can install word2vec extensions from marketplace.

HTH!

YY

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining - Documents Similarity (words position)

Answers