Export "Data To Similarity" results to a CSV

ClaraCaba · March 2016

Hi!

I am working with text mining in Rapidminer and the following problem has arised:

I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.

Is there any way to export that table?

Thank you very much!

MartinLiebig · April 2016

Hi,

you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.

~Martin

ClaraCaba · April 2016

Thank you very much, that worked perfectly.

I am facing now another problem, though.

After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?

Thank you in advance.

MartinLiebig · April 2016

Hi,

Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.

~Martin

ClaraCaba · April 2016

Hi,

Thank you very much!

However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?

Thank you.

MartinLiebig · April 2016

Hi,

a general idea is to use Cross Distance, it is a bit more flexible.

For your question:
Do i understand it correctly, that you have the distance twice in like this


ID1   ID2   SIM
2       1       0.5
1       2       0.5

My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like

SmallNumber _ BigNumber

This results in this:


if([FIRST_ID]>[SECOND_ID],
	concat(str([FIRST_ID]),"_",str([SECOND_ID])),
	concat(str([SECOND_ID]),"_",str([FIRST_ID]))
)

Afterwards you can use Remove Duplicates on this. See attached Process


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="7.0.001" expanded="true" height="82" name="Data to Similarity" width="90" x="246" y="34"/>
      <operator activated="true" class="similarity_to_data" compatibility="7.0.001" expanded="true" height="82" name="Similarity to Data" width="90" x="380" y="34"/>
      <operator activated="true" class="generate_attributes" compatibility="7.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="IdToRemoveDuplicates" value="if([FIRST_ID]&gt;[SECOND_ID],&#10;&#9;concat(str([FIRST_ID]),&quot;_&quot;,str([SECOND_ID])),&#10;&#9;concat(str([SECOND_ID]),&quot;_&quot;,str([FIRST_ID]))&#10;)"/>
        </list>
        <description align="center" color="transparent" colored="false" width="126">Create an ID to remove the stuff</description>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="7.0.001" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="IdToRemoveDuplicates"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
      <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
      <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

ClaraCaba · April 2016

Thank you very very very much!!!

That worked perfectly.

sangeet171188 · June 2017

And how to get count of similar looking sets( Text field). For the below set I want count like

ABC is good text -----3

XYZ is great -----------2

FIRST SECOND SIMILARITY textfield

1 2 1 ABC is a good text

3 8 1 ABC is a good text

4 9 1 ABC is a good text

12 32 1 XYZ is great

31 77 1 XYZ is great

Thomas_Ott · June 2017

Can't you use an Aggregate operator for this?

sangeet171188 · June 2017

Thanks Thomas. Results achieved.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Export "Data To Similarity" results to a CSV

Answers