Arrange list of names by similarity?

DAVID_EALES · May 2018

Hi All,

I am a complete novice with RapidMiner and despite watching muliple videos and trawling the forum, I am unable to get my head around how to solve what I think is a very simple problem!

I have a list of names (approx 5k), all I want to achieve is to sort this list of names by similarity.

All that I have process wise so far is....

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/email test"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="name_recipients"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="514" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I would be most grateful for anyone's assistance.

Kind Regards

lionelderkrikor · May 2018

Hi again @DAVID_EALES,

Interesting but difficult task.....

I found a ressource which seems interesting for your project in the community.

To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.

This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.

Tested like this with your (very partial) example set :

this process give the following result :

I hope it will be useful.

Regards,

Lionel

lionelderkrikor · May 2018

Hi @DAVID_EALES,

Here a process, which compute and sort the Distance between the names of a list, using the Data to Similarity operator :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.0.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Att1&#10;Michael,&#10;Lionel,&#10;John,&#10;Jordan,&#10;Bruce,&#10;Dan,&#10;Jordan,&#10;Michel"/>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="8.1.003" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <operator activated="true" class="similarity_to_data" compatibility="8.1.003" expanded="true" height="82" name="Similarity to Data" width="90" x="447" y="85"/>
      <operator activated="true" class="sort" compatibility="8.1.003" expanded="true" height="82" name="Sort" width="90" x="581" y="85">
        <parameter key="attribute_name" value="DISTANCE"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
      <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
      <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I don't know your dataset and what exactly you want to do, but, in case of nominal attributes (the names in your case), the distance will be always 0 (in case of perfect matching between

the 2 names, in other words the 2 names are the same) or 1 (in the other cases). So your table will be filled only with "1" and "0".

Regards,

Lionel

Telcontar120 · May 2018

In the free Operator Toolbox extension, there is an operator to Generate Levenshtein Distance, which is more in line with I think what you want to do. But I am not sure exactly what you mean by sorting the list because to do that you would first have to select one name as the reference name to which all other names' similarity would be calculated.

DAVID_EALES · May 2018

Thanks to all for your replies thus far

To explain further, I want to group/cluster? email addresses based on similarity rather than alphabetically so for example....

Alphabetical sort....

1joe.bloggs@domain.com

a.user@domain.com

another.person@domain.com

joe.bloggs@domain.com

k@domain.com

soe.blogs@domain.com

What I am trying to achieve....

a.user@domain.com
another.person@domain.com
1joe.bloggs@domain.com
joe.bloggs@domain.com
soe.blogs@domain.com
k@domain.com

I understand about the distance measurement, but how do I take that distance measurement and use it to rearrange the output?

Hope the above makes sense.

Kind Regards

DAVID_EALES · May 2018

Many Thanks Lionel, your idea worked.

Kind Regards

DAVID_EALES · May 2018

Ok, so the solution proposed by Lionel worked during testing, but I am unable to get it to run through the entire list as I am getting Error 504.

I have split the data into batches of 1000 rows and it all processes fine but I need it to be able to process the entire list of 5k entries at once.

Is this somesort of timeout error? I have looked at the rosette documentation and I cant find any mention of it.

Kind Regards

lionelderkrikor · May 2018

Hi @DAVID_EALES,

Accordind to your last message, It's working for dataset up to 1K rows --> OK

But : normaly, it work with dataset up to 10k rows grasiously (see the documentation (description) of RapidMiner)).

I contacted the support of Rosette to see what's going on with this error (error504).(maybe an updated limitation...)

Regards,

Lionel

lionelderkrikor · May 2018

Hi @DAVID_EALES,

It seems that your hypothesis is the right one.

Rosette is working on a fix for the next release. Here the answer of Rosette :

"Lionel,

We were able to trace this to an internal issue where our Name Deduplicate endpoint is timing out on large calls. Our suggestion would be to break the calls up to smaller chunks. We have an open an internal L3 Issue to correct this in a future release. Also for future reference here is a link to our error codes.

https://developer.rosette.com/features-and-functions#errors

I will hold this ticket open and will provide you a follow on update once we release a complete fix for this issue.

Best Regards,"

Regards,

Lionel

DAVID_EALES · May 2018

Thank You Lionel, much appreciated.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Arrange list of names by similarity?

Best Answer

Answers