"Finding the most similar document(s) in a collection to a test document"

crcowan · September 2009

I have built an operator chain to compare a test document to a collection of documents in order to find the most similar documents to the test document. My original apporach did a similarity test across all documents (the collection and the test document) and filtered out just the results for the test document. Needless to say this resulted in comparing all of the collection against itself and was thus inefficient. I have since then tried the approach recommended at the end of this thread: http://rapid-i.com/rapidforum/index.php/topic,680.msg2587.html#msg2587. Unfortunately I am afraid I have produced a fairly inefficient solution. Could you look at the chain below and give me some advise to improve it?

A couple of considerations:

I do the text input processing against the collection and the test document so that I have a consistent vocabulary for the similarity processing.
I get a text file in the log but would prefer an excel or CSV output. Perhaps I can do this in the ProcessLog with some constants for quotation marks and commas.

Here is my chain:

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.5">

  <operator name="Root" class="Process" expanded="yes">
      <description text=""/>
      <parameter key="logverbosity"	value="init"/>
      <parameter key="random_seed"	value="2001"/>
      <parameter key="send_mail"	value="never"/>
      <parameter key="process_duration_for_mail"	value="30"/>
      <parameter key="encoding"	value="SYSTEM"/>
      <operator name="TextInput" class="TextInput" expanded="yes">
          <list key="texts">
            <parameter key="Past Performance"	value="C:\Documents and Settings\xxxx\Desktop\Mining\PP"/>
            <parameter key="RFP"	value="C:\Documents and Settings\xxxx\Desktop\Mining\RFP"/>
          </list>
          <parameter key="default_content_type"	value=""/>
          <parameter key="default_content_encoding"	value=""/>
          <parameter key="default_content_language"	value=""/>
          <parameter key="prune_below"	value="3"/>
          <parameter key="prune_above"	value="90"/>
          <parameter key="vector_creation"	value="TFIDF"/>
          <parameter key="use_content_attributes"	value="false"/>
          <parameter key="use_given_word_list"	value="false"/>
          <parameter key="return_word_list"	value="false"/>
          <parameter key="id_attribute_type"	value="short"/>
          <list key="namespaces">
          </list>
          <parameter key="create_text_visualizer"	value="true"/>
          <parameter key="on_the_fly_pruning"	value="-1"/>
          <parameter key="extend_exampleset"	value="false"/>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"	value="3"/>
              <parameter key="max_chars"	value="2147483647"/>
          </operator>
          <operator name="PorterStemmer" class="PorterStemmer">
          </operator>
          <operator name="TermNGramGenerator" class="TermNGramGenerator">
              <parameter key="max_length"	value="2"/>
          </operator>
      </operator>
      <operator name="Push PP and RFP" class="IOMultiplier">
          <parameter key="number_of_copies"	value="1"/>
          <parameter key="io_object"	value="ExampleSet"/>
          <parameter key="multiply_type"	value="multiply_one"/>
          <parameter key="multiply_which"	value="1"/>
      </operator>
      <operator name="Get the RFP" class="ExampleFilter">
          <parameter key="condition_class"	value="attribute_value_filter"/>
          <parameter key="parameter_string"	value="label = RFP"/>
          <parameter key="invert_filter"	value="false"/>
      </operator>
      <operator name="Store the RFP" class="IOStorer">
          <parameter key="name"	value="RFP"/>
          <parameter key="io_object"	value="ExampleSet"/>
          <parameter key="store_which"	value="1"/>
          <parameter key="remove_from_process"	value="true"/>
      </operator>
      <operator name="Get the PP Docs" class="ExampleFilter">
          <parameter key="condition_class"	value="attribute_value_filter"/>
          <parameter key="parameter_string"	value="label = RFP"/>
          <parameter key="invert_filter"	value="true"/>
      </operator>
      <operator name="DataMacroDefinition" class="DataMacroDefinition">
          <parameter key="macro"	value="NumXmp"/>
          <parameter key="macro_type"	value="number_of_examples"/>
          <parameter key="statistics"	value="average"/>
      </operator>
      <operator name="Store the PP Docs" class="IOStorer">
          <parameter key="name"	value="PP"/>
          <parameter key="io_object"	value="ExampleSet"/>
          <parameter key="store_which"	value="1"/>
          <parameter key="remove_from_process"	value="true"/>
      </operator>
      <operator name="IteratingOperatorChain" class="IteratingOperatorChain" expanded="yes">
          <parameter key="iterations"	value="%{NumXmp}"/>
          <parameter key="timeout"	value="-1"/>
          <operator name="Get All PP" class="IORetriever">
              <parameter key="name"	value="PP"/>
              <parameter key="io_object"	value="ExampleSet"/>
              <parameter key="remove_from_store"	value="false"/>
          </operator>
          <operator name="Filter to Current PP" class="ExampleRangeFilter">
              <parameter key="first_example"	value="%{a}"/>
              <parameter key="last_example"	value="%{a}"/>
              <parameter key="invert_filter"	value="false"/>
          </operator>
          <operator name="Get RFP" class="IORetriever">
              <parameter key="name"	value="RFP"/>
              <parameter key="io_object"	value="ExampleSet"/>
              <parameter key="remove_from_store"	value="false"/>
          </operator>
          <operator name="Combine PP and RFP" class="ExampleSetMerge">
              <parameter key="merge_type"	value="first_two"/>
              <parameter key="datamanagement"	value="double_array"/>
          </operator>
          <operator name="ExampleSet2Similarity" class="ExampleSet2Similarity">
              <parameter key="keep_example_set"	value="true"/>
              <parameter key="measure_types"	value="NumericalMeasures"/>
              <parameter key="mixed_measure"	value="MixedEuclideanDistance"/>
              <parameter key="nominal_measure"	value="NominalDistance"/>
              <parameter key="numerical_measure"	value="CosineSimilarity"/>
              <parameter key="divergence"	value="GeneralizedIDivergence"/>
              <parameter key="kernel_type"	value="radial"/>
              <parameter key="kernel_gamma"	value="1.0"/>
              <parameter key="kernel_sigma1"	value="1.0"/>
              <parameter key="kernel_sigma2"	value="0.0"/>
              <parameter key="kernel_sigma3"	value="2.0"/>
              <parameter key="kernel_degree"	value="3.0"/>
              <parameter key="kernel_shift"	value="1.0"/>
              <parameter key="kernel_a"	value="1.0"/>
              <parameter key="kernel_b"	value="0.0"/>
          </operator>
          <operator name="Similarity2ExampleSet" class="Similarity2ExampleSet">
              <parameter key="table_type"	value="long_table"/>
          </operator>
          <operator name="Get ID" class="Data2Log">
              <parameter key="attribute_name"	value="SECOND_ID"/>
              <parameter key="example_index"	value="1"/>
          </operator>
          <operator name="Get Similarity" class="Data2Log">
              <parameter key="attribute_name"	value="SIMILARITY"/>
              <parameter key="example_index"	value="2"/>
          </operator>
          <operator name="ProcessLog" class="ProcessLog">
              <parameter key="filename"	value="C:\Documents and Settings\xxxx\Desktop\test.log"/>
              <list key="log">
                <parameter key="ID"	value="operator.Get ID.value.data_value"/>
                <parameter key="Similarity"	value="operator.Get Similarity.value.data_value"/>
              </list>
              <parameter key="sorting_type"	value="none"/>
              <parameter key="sorting_k"	value="100"/>
              <parameter key="persistent"	value="true"/>
          </operator>
          <operator name="Get rid of current example set" class="IOConsumer">
              <parameter key="io_object"	value="ExampleSet"/>
              <parameter key="deletion_type"	value="delete_one"/>
              <parameter key="delete_which"	value="1"/>
              <parameter key="except"	value="1"/>
          </operator>
      </operator>
  </operator>

</process>

I thank you very much in advance for your thoughts and recommendations.

Charles

land · September 2009

Hi Charles,
although you might nearly do everything with a good choice of operators, these solutions are mostly far away from efficiency. If you want an efficient solution, our InformationRetrieval plugin might be worth a try. Unfortunately it's not finished yet, but at least an operator for calculating the distances between each examples of a first exampleset to each of a second. It then returns the k nearest examples and its distances.

If you are interested we might think about a price reduced pre-version.
If you want to write the similarities into an excel file, just apply Similarity2ExampleSet and write the resulting example set into an excel file using the ExcelExampleSetWriter.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Finding the most similar document(s) in a collection to a test document"

Answers