RAPIDMINER 9.7 BETA ANNOUNCEMENT

The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!

CLICK HERE TO DOWNLOAD

Export "Data To Similarity" results to a CSV

ClaraCabaClaraCaba Member Posts: 9 Contributor I
edited November 2018 in Help
Hi!

I am working with text mining in Rapidminer and the following problem has arised:

I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.

Is there any way to export that table?

Thank you very much!
Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,408  RM Data Scientist
    Hi,

    you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.

    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor I
    Thank you very much, that worked perfectly.

    I am facing now another problem, though.

    After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?

    Thank you in advance.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,408  RM Data Scientist
    Hi,

    Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.

    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor I
    Hi,

    Thank you very much!

    However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?

    Thank you.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,408  RM Data Scientist
    Hi,

    a general idea is to use Cross Distance, it is a bit more flexible.

    For your question:
    Do i understand it correctly, that you have the distance twice in like this

    ID1  ID2  SIM
    2      1      0.5
    1      2      0.5
    My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like

    SmallNumber _ BigNumber

    This results in this:

    if([FIRST_ID]>[SECOND_ID],
    concat(str([FIRST_ID]),"_",str([SECOND_ID])),
    concat(str([SECOND_ID]),"_",str([FIRST_ID]))
    )
    Afterwards you can use Remove Duplicates on this. See attached Process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
          <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
            <list key="function_descriptions">
              <parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]&gt;[SECOND_ID],&#10;&#9;concat(str([FIRST_ID]),&quot;_&quot;,str([SECOND_ID])),&#10;&#9;concat(str([SECOND_ID]),&quot;_&quot;,str([FIRST_ID]))&#10;)"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="IdToRemoveDuplicates"/>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
          <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
          <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor I
    Thank you very very very much!!! :D

    That worked perfectly.
  • sangeet171188sangeet171188 Member Posts: 8 Contributor I

    And how to get count of similar looking sets( Text field). For the below set I want count like

    ABC is good text -----3

    XYZ is great -----------2

     

    FIRST SECOND SIMILARITY textfield

    1               2                    1           ABC is a good text

    3                8                    1           ABC is a good text

    4                9                      1          ABC is a good text

    12              32                    1            XYZ is great 

    31              77                    1            XYZ is great

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Can't you use an Aggregate operator for this?

  • sangeet171188sangeet171188 Member Posts: 8 Contributor I

    Thanks Thomas. Results achieved.

Sign In or Register to comment.