Data to Similarity - how to define the control group

christina_dehmechristina_dehme Member Posts: 3 Contributor I
edited November 2018 in Help

Hi everyone,

 

i have a large number of documents (one folder "auditor report" and one "audit committee report"(AC) ) and want to compare them. With the operator "Data to similarity" the programm compares each file with each file. I want to compare just the matching file names. 

The documents in the folder 1 "auditor report" are named: year_company name

and the documents in the folder 2 "audit committee report" are named: AC_year_company name

 

So instead of comparing each document with each document from the other file i just want to compare the matching documents (= same year and company name in the document name).

 

Many thanks in advance!!!

 

Christina

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Assuming the time stamps match (i.e. yyyy in one file and yyyy in the othe file), just use a Join operator first to join the two files together and match on your timestamp. Then use the similarity measures. 

  • christina_dehmechristina_dehme Member Posts: 3 Contributor I

    Hi Thomas,

    thanks for the quick reply. I tried it with the operator "join" before testing on similarity. I chose join type "inner" and used as key attributes "metadata_file" for the right and the left key attribute. But somehow it didn't work out as i was expecting it.

    For example:

    AC_2015_A.G.Barr PLC,GB00B6XZKY75

    should match before i use the similarity operator with
    2015_A.G.Barr PLC,GB00B6XZKY75. 

    So that the similarity test just runs between those two files (almost same name just once with and once without AC in the doc name) instead of comparing each doc with another.

     

    This is what I've got:

    similarity_test.jpgsimilarity_test_2.jpg

    Thanks a lot in advance

     

    Christina

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Dear Christina,

     

    i do not think that there is anyway to do this w/o a loop. Propably something like Loop Values, Filter Examples for the value, left join with the other table and than data to similarity.

    In RM 7 we added a Group into Collection operator in the operator toolbox extension. That would make it a bit nicer.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • christina_dehmechristina_dehme Member Posts: 3 Contributor I

    Dear Martin, 

     

    i installed the new version of RM. I still get the same results and dont see a way how to solve my problem of matching samples. I have to files and the programm should be able to read the name of each document and just check the matching ones for similarity. Its still comparing all documents with each other. As i have over 400 documents in total the program does not run with so many. 

     

    Thanks in advance

     

    Christina

     

     

    Here you can see which match i want to have. So my question is which operator do i have to use ? In excel it would work with =A2="AC_"&B2

     

     

    ,similarity_test.jpg

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Dear Christina,

     

    i thought about something along the lines of the attached process. Not too handsome but working. 7.5 has a bit of a different loop interface but parallized and therefore way faster loops.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="7.3.001" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"/>
    <operator activated="true" class="generate_id" compatibility="7.3.001" expanded="true" height="82" name="Generate ID" width="90" x="246" y="34"/>
    <operator activated="true" class="extract_macro" compatibility="7.3.001" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34">
    <parameter key="macro" value="exa"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_data" compatibility="7.3.001" expanded="true" height="68" name="Generate Data (2)" width="90" x="112" y="136"/>
    <operator activated="true" class="generate_id" compatibility="7.3.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="246" y="136"/>
    <operator activated="true" class="loop" compatibility="7.3.001" expanded="true" height="103" name="Loop" width="90" x="514" y="85">
    <parameter key="set_iteration_macro" value="true"/>
    <parameter key="iterations" value="%{exa}"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="7.3.001" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
    <parameter key="first_example" value="%{iteration}"/>
    <parameter key="last_example" value="%{iteration}"/>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="7.3.001" expanded="true" height="82" name="Filter Example Range (2)" width="90" x="179" y="187">
    <parameter key="first_example" value="%{iteration}"/>
    <parameter key="last_example" value="%{iteration}"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.3.001" expanded="true" height="103" name="Append" width="90" x="313" y="85"/>
    <operator activated="true" class="data_to_similarity" compatibility="7.3.001" expanded="true" height="82" name="Data to Similarity" width="90" x="447" y="85"/>
    <operator activated="true" class="similarity_to_data" compatibility="7.3.001" expanded="true" height="82" name="Similarity to Data" width="90" x="581" y="85"/>
    <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_port="input 2" to_op="Filter Example Range (2)" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Filter Example Range (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_op="Data to Similarity" to_port="example set"/>
    <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
    <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
    <connect from_op="Similarity to Data" from_port="exampleSet" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="126"/>
    <portSpacing port="source_input 3" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.3.001" expanded="true" height="82" name="Append (2)" width="90" x="648" y="85"/>
    <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Generate Data (2)" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Loop" to_port="input 2"/>
    <connect from_op="Loop" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Append (2)" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

     

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.