Filtering of documents using document similarity

kdafoe · June 2022

I have 100 comments that I process and run Document Similarity against. Works great, but I'm only interested in finding documents containing specific words, say "workload". So I filter the example set, rerun the process, and document similarity gives me the results on those documents only containing "workload". Perfect. Problem is, by filtering it also remaps the IDs of the documents to the filtered set so I no longer know their original document IDs. This makes it very difficult to find the originals because a .73 similarity between doc ids 1 and 17 in the filtered set does not map to documents 1 and 17 in the original non-filtered data set.

Is there a way to keep the original IDs in the filtered dataset?

BalazsBarany · June 2022

Hi!

After filtering you could use Generate ID. It adds exactly the same numbers (the document number) to the example set as a new attribute. Then you can use Remember or Store to get a copy of the data with the IDs and then Join to get back the matching document text.

Here's an example process.

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.008" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="-1"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="9.10.008" expanded="true" height="82" name="Get example data" width="90" x="45" y="34">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.10.008" expanded="true" height="68" name="Retrieve EmployeeData" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Community Samples/Community Real World Use Cases/Employee Attrition/EmployeeData"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.10.008" expanded="true" height="82" name="Length of CanDoBetter" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="lenCanDB" value="length(CanDoBetter)"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.10.008" expanded="true" height="103" name="Filter too short" width="90" x="313" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="lenCanDB.ge.25"/>
            </list>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.10.008" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="CanDoBetter"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.10.008" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <connect from_op="Retrieve EmployeeData" from_port="output" to_op="Length of CanDoBetter" to_port="example set input"/>
          <connect from_op="Length of CanDoBetter" from_port="example set output" to_op="Filter too short" to_port="example set input"/>
          <connect from_op="Filter too short" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="9.4.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="9.4.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="34">
            <parameter key="transform_to" value="lower case"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_id" compatibility="9.10.008" expanded="true" height="82" name="Generate ID" width="90" x="313" y="34">
        <parameter key="create_nominal_ids" value="false"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" class="remember" compatibility="9.10.008" expanded="true" height="68" name="Remember data with IDs" width="90" x="447" y="34">
        <parameter key="name" value="withId"/>
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="store_which" value="1"/>
        <parameter key="remove_from_process" value="true"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.10.008" expanded="true" height="103" name="Split Data" width="90" x="112" y="187">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.5"/>
          <parameter key="ratio" value="0.5"/>
        </enumeration>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="9.10.008" expanded="true" height="103" name="Cross Distances" width="90" x="246" y="187">
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="only_top_k" value="true"/>
        <parameter key="k" value="1000"/>
        <parameter key="search_for" value="nearest"/>
        <parameter key="compute_similarities" value="false"/>
      </operator>
      <operator activated="true" class="recall" compatibility="9.10.008" expanded="true" height="68" name="Recall" width="90" x="112" y="340">
        <parameter key="name" value="withId"/>
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="remove_from_store" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.10.008" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="340">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="id|text"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.10.008" expanded="true" height="82" name="Join" width="90" x="447" y="187">
        <parameter key="remove_double_attributes" value="true"/>
        <parameter key="join_type" value="inner"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="request" value="id"/>
        </list>
        <parameter key="keep_both_join_attributes" value="false"/>
      </operator>
      <operator activated="true" class="recall" compatibility="9.10.008" expanded="true" height="68" name="Recall (2)" width="90" x="380" y="391">
        <parameter key="name" value="withId"/>
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="remove_from_store" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.10.008" expanded="true" height="82" name="Select Attributes (3)" width="90" x="514" y="340">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="id|text"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="blending:rename" compatibility="9.10.008" expanded="true" height="82" name="Rename" width="90" x="648" y="340">
        <list key="rename attributes">
          <parameter key="text" value="othertext"/>
        </list>
        <parameter key="from_attribute" value=""/>
        <parameter key="to_attribute" value=""/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.10.008" expanded="true" height="82" name="Set Role" width="90" x="715" y="238">
        <parameter key="attribute_name" value="othertext"/>
        <parameter key="target_role" value="regular"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:join" compatibility="9.10.008" expanded="true" height="82" name="Join (2)" width="90" x="782" y="136">
        <parameter key="remove_double_attributes" value="false"/>
        <parameter key="join_type" value="inner"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="document" value="id"/>
        </list>
        <parameter key="keep_both_join_attributes" value="false"/>
      </operator>
      <connect from_op="Get example data" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Remember data with IDs" to_port="store"/>
      <connect from_op="Remember data with IDs" from_port="stored" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_op="Join" to_port="left"/>
      <connect from_op="Recall" from_port="result" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_op="Join (2)" to_port="left"/>
      <connect from_op="Recall (2)" from_port="result" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Join (2)" to_port="right"/>
      <connect from_op="Join (2)" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

It is admittedly a bit involved but you can get the desired result with a few additional operators.

Regards,
Balázs

kdafoe · June 2022

Thanks for your great example, Balázs. After I posted the question and thought about it a bit more, I thought maybe generate attributes to match the original IDs? But your solution is better and gives me more ideas to work through. Thanks again.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Filtering of documents using document similarity

Best Answers