Join if partly match

online360online360 Member Posts: 34 Contributor II
edited December 2019 in Help
Hi!

I'm working on increasing the relevance of product-search results on our website by importing synonyms into our system.
Therefore, I downloaded the synonym list from opentheasaurus.

What I'd like to do now is not to import all the synoyms (as they would increase the indexing time) but only want to import those where one of the words in a list of matching synonyms is also included in our database.

I therefore processed out product documents to get a word list and converted it to data. First question: Which type of data is it now? Text or polynominal?

Second question:
How can I now filter out those synonym-pairs where none of the included synonyms is also in the word list?

An example:
My list
Bike
Boat
Car

The synonym list from opentheasaurus
bike, bicycle
boat, motorboat, sailboat
airplane, plane

In the example above, the resulting data set should be:
bike, bicycle
boat, motorboat, sailboat

(as "airplane, plane" isn't in the wordlist)

I tried using loop through attributes - and values and several other combinations.

Is there even a simple method for that?

Thanks!
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    sounds like you can use a Generate Attribute to generate new Attribute like "Contains Bike" or so and then join on this?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • online360online360 Member Posts: 34 Contributor II
    Hi!

    Thanks!
    You mean like the following process?

    At the moment for example "cable" would also be found if the synonym is named "energy-cable" or whatever. (Please see the function in "generate attribute")
    Is there a way to only find those attributes that don't have any other letter at the beginning and the end of the loop_value (only space, comma or punctuation mark would be allowed; I guess using regex)?

    Thanks!
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve t123_product_words" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Local Repository/data/t123_product_words"/>
          </operator>
          <operator activated="true" class="sample_stratified" compatibility="7.1.000" expanded="true" height="82" name="Sample (Stratified)" width="90" x="246" y="85"/>
          <operator activated="true" breakpoints="after" class="loop_values" compatibility="7.1.000" expanded="true" height="82" name="Loop Values" width="90" x="380" y="85">
            <parameter key="attribute" value="word"/>
            <process expanded="true">
              <operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve synonyms_all" width="90" x="179" y="85">
                <parameter key="repository_entry" value="//Local Repository/data/synonyms_all"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="7.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="136">
                <list key="function_descriptions">
                  <parameter key="contains_attribute" value="if(contains(att1,%{loop_value}),att1,&quot;NOMATCH&quot;)"/>
                </list>
              </operator>
              <connect from_op="Retrieve synonyms_all" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="7.1.000" expanded="true" height="82" name="Append" width="90" x="514" y="85"/>
          <operator activated="true" class="remove_duplicates" compatibility="7.1.000" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="contains_attribute"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="782" y="85">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="contains_attribute.does_not_equal.NOMATCH"/>
            </list>
          </operator>
          <connect from_op="Retrieve t123_product_words" from_port="output" to_op="Sample (Stratified)" to_port="example set input"/>
          <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    sure. I think contains actually takes regexes, even though it is not explicity documented.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • online360online360 Member Posts: 34 Contributor II
    Hi Martin!

    I added a "split" operator into the loop so it can test against each attribute using an exact match comparison.

    How can I say euqal either attribute1 or attribute2 or attribut3, ...?
    The process tells me that "||" is only allowed for boolean or numerical attributes.

    Thanks,
    Steven

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve t123_product_words" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Local Repository/data/t123_product_words"/>
          </operator>
          <operator activated="true" class="sample_stratified" compatibility="7.1.000" expanded="true" height="82" name="Sample (Stratified)" width="90" x="246" y="85"/>
          <operator activated="true" breakpoints="after" class="loop_values" compatibility="7.1.000" expanded="true" height="82" name="Loop Values" width="90" x="380" y="85">
            <parameter key="attribute" value="word"/>
            <process expanded="true">
              <operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve synonyms_all" width="90" x="179" y="85">
                <parameter key="repository_entry" value="//Local Repository/data/synonyms_all"/>
              </operator>
              <operator activated="true" class="split" compatibility="7.1.000" expanded="true" height="82" name="Split" width="90" x="313" y="85">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="att1"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="7.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="136">
                <list key="function_descriptions">
                  <parameter key="contains_attribute" value="if(equals([att1_1]||[att1_2]||[att1_3]||[att1_4]||[att1_5]||[att1_6]||[att1_7]||[att1_8]||[att1_9]||[att1_10]||[att1_11]||[att1_12]||[att1_13]||[att1_14]||[att1_15]||[att1_16]||[att1_17]||[att1_18]||[att1_19],%{loop_value}),&quot;YESMATCH&quot;,&quot;NOMATCH&quot;)"/>
                </list>
              </operator>
              <connect from_op="Retrieve synonyms_all" from_port="output" to_op="Split" to_port="example set input"/>
              <connect from_op="Split" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="7.1.000" expanded="true" height="82" name="Append" width="90" x="514" y="85"/>
          <operator activated="true" class="remove_duplicates" compatibility="7.1.000" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="contains_attribute"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="782" y="85">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="contains_attribute.does_not_equal.NOMATCH"/>
            </list>
          </operator>
          <connect from_op="Retrieve t123_product_words" from_port="output" to_op="Sample (Stratified)" to_port="example set input"/>
          <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • online360online360 Member Posts: 34 Contributor II
    Hi Martin!

    I matched making this process work but unfortunately it always gets stuck between loop 150 and 300.

    Do you have an idea to make this easier or to make it consume less memory?:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve t123_product_words" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Local Repository/data/t123_product_words"/>
          </operator>
          <operator activated="true" breakpoints="after" class="loop_values" compatibility="7.1.001" expanded="true" height="82" name="Loop Values" width="90" x="380" y="85">
            <parameter key="attribute" value="word"/>
            <process expanded="true">
              <operator activated="true" class="retrieve" compatibility="7.1.001" expanded="true" height="68" name="Retrieve synonyms_all_lowercase_splitted_trimmed" width="90" x="313" y="85">
                <parameter key="repository_entry" value="../data/synonyms_all_lowercase_splitted_trimmed"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="7.1.001" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="648" y="136">
                <list key="function_descriptions">
                  <parameter key="searched_word" value="trim(%{loop_value})"/>
                </list>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="7.1.001" expanded="true" height="82" name="Generate Attributes" width="90" x="782" y="136">
                <list key="function_descriptions">
                  <parameter key="contains_attribute" value="if([word_1]==[searched_word]||[word_2]==[searched_word]||[word_3]==[searched_word]||[word_4]==[searched_word]||[word_5]==[searched_word]||[word_6]==[searched_word]||[word_7]==[searched_word]||[word_8]==[searched_word]||[word_9]==[searched_word]||[word_10]==[searched_word]||[word_11]==[searched_word]||[word_12]==[searched_word]||[word_13]==[searched_word]||[word_14]==[searched_word]||[word_15]==[searched_word]||[word_16]==[searched_word]||[word_17]==[searched_word]||[word_18]==[searched_word]||[word_19]==[searched_word]||[word_20]==[searched_word]||[word_21]==[searched_word]||[word_22]==[searched_word]||[word_23]==[searched_word]||[word_24]==[searched_word]||[word_25]==[searched_word]||[word_26]==[searched_word]||[word_27]==[searched_word]||[word_28]==[searched_word]||[word_29]==[searched_word]||[word_30]==[searched_word]||[word_31]==[searched_word]||[word_32]==[searched_word]||[word_33]==[searched_word]||[word_34]==[searched_word]||[word_35]==[searched_word]||[word_36]==[searched_word]||[word_37]==[searched_word]||[word_38]==[searched_word]||[word_39]==[searched_word]||[word_40]==[searched_word]||[word_41]==[searched_word]||[word_42]==[searched_word]||[word_43]==[searched_word]||[word_44]==[searched_word]||[word_45]==[searched_word]||[word_46]==[searched_word]||[word_47]==[searched_word]||[word_48]==[searched_word]||[word_49]==[searched_word]||[word_50]==[searched_word]||[word_51]==[searched_word]||[word_52]==[searched_word]||[word_53]==[searched_word]||[word_54]==[searched_word]||[word_55]==[searched_word]||[word_56]==[searched_word]||[word_57]==[searched_word]||[word_58]==[searched_word]||[word_59]==[searched_word]||[word_60]==[searched_word]||[word_61]==[searched_word]||[word_62]==[searched_word]||[word_63]==[searched_word]||[word_64]==[searched_word]||[word_65]==[searched_word]||[word_66]==[searched_word]||[word_67]==[searched_word]||[word_68]==[searched_word]||[word_69]==[searched_word]||[word_70]==[searched_word]||[word_71]==[searched_word]||[word_72]==[searched_word]||[word_73]==[searched_word]||[word_74]==[searched_word]||[word_75]==[searched_word]||[word_76]==[searched_word]||[word_77]==[searched_word]||[word_78]==[searched_word]||[word_79]==[searched_word]||[word_80]==[searched_word]||[word_81]==[searched_word]||[word_82]==[searched_word]||[word_83]==[searched_word]||[word_84]==[searched_word]||[word_85]==[searched_word]||[word_86]==[searched_word]||[word_87]==[searched_word]||[word_88]==[searched_word]||[word_89]==[searched_word]||[word_90]==[searched_word],&quot;YES&quot;,&quot;NO&quot;)"/>
                </list>
              </operator>
              <connect from_op="Retrieve synonyms_all_lowercase_splitted_trimmed" from_port="output" to_op="Generate Attributes (2)" to_port="example set input"/>
              <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="7.1.001" expanded="true" height="82" name="Append" width="90" x="514" y="85"/>
          <operator activated="true" class="filter_examples" compatibility="7.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="782" y="85">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="contains_attribute.does_not_equal.YES"/>
            </list>
          </operator>
          <operator activated="true" class="store" compatibility="7.1.001" expanded="true" height="68" name="Store" width="90" x="983" y="85">
            <parameter key="repository_entry" value="//Local Repository/data/t123_synonyms_processed"/>
          </operator>
          <connect from_op="Retrieve t123_product_words" from_port="output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Store" to_port="input"/>
          <connect from_op="Store" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Thanks,
    Steven
Sign In or Register to comment.