Arrange list of names by similarity?

DAVID_EALESDAVID_EALES Member Posts: 5 Contributor I
edited November 2018 in Help

Hi All,

 

I am a complete novice with RapidMiner and despite watching muliple videos and trawling the forum, I am unable to get my head around how to solve what I think is a very simple problem!

 

I have a list of names (approx 5k), all I want to achieve is to sort this list of names by similarity. 

 

All that I have process wise so far is....

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/email test"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="name_recipients"/>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="514" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

I would be most grateful for anyone's assistance.

 

Kind Regards

 

 

Tagged:

Best Answer

  • lionelderkrikorlionelderkrikor Posts: 731   Unicorn
    Solution Accepted

    Hi again @DAVID_EALES,

     

    Interesting but difficult task.....

    I found a ressource which seems interesting for your project in the community.

     

    To sum up, you can use the Deduplicate Names operator of the Rosette Text Analytics extension.

    This extension must be installed from Marketplace. Moreover, you must obtain an API key to use this extension.

     

    Tested like this with your (very partial) example set : 

    Cluster_names.png

     

    this process give the following result : 

    Cluster_names_2.png

    I hope it will be useful.

     

    Regards,

     

    Lionel

     

     

     

     

     

     

     

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 731   Unicorn

    Hi @DAVID_EALES,

     

    Here a process, which compute and sort the Distance between the names of a list,  using the Data to Similarity operator : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.0.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Att1&#10;Michael,&#10;Lionel,&#10;John,&#10;Jordan,&#10;Bruce,&#10;Dan,&#10;Jordan,&#10;Michel"/>
    </operator>
    <operator activated="true" class="data_to_similarity" compatibility="8.1.003" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <operator activated="true" class="similarity_to_data" compatibility="8.1.003" expanded="true" height="82" name="Similarity to Data" width="90" x="447" y="85"/>
    <operator activated="true" class="sort" compatibility="8.1.003" expanded="true" height="82" name="Sort" width="90" x="581" y="85">
    <parameter key="attribute_name" value="DISTANCE"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Data to Similarity" to_port="example set"/>
    <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
    <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
    <connect from_op="Similarity to Data" from_port="exampleSet" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I don't know your dataset and what exactly you want to do, but, in case of nominal attributes (the names in your case), the distance will be always 0 (in case of perfect matching between

    the 2 names, in other words the 2 names are the same) or 1 (in the other cases). So your table will be filled only with "1" and "0".

     

    Regards,

     

    Lionel

     

     

     

    DAVID_EALES
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,172   Unicorn

    In the free Operator Toolbox extension, there is an operator to Generate Levenshtein Distance, which is more in line with I think what you want to do.  But I am not sure exactly what you mean by sorting the list because to do that you would first have to select one name as the reference name to which all other names' similarity would be calculated.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    DAVID_EALES
  • DAVID_EALESDAVID_EALES Member Posts: 5 Contributor I

    Thanks to all for your replies thus far :)

     

    To explain further, I want to group/cluster? email addresses based on similarity rather than alphabetically so for example....

     

    Alphabetical sort....

     

    [email protected]
    [email protected]
    [email protected]
    [email protected]
    [email protected]
    [email protected]

     

     

    What I am trying to achieve....

     

    [email protected]
    [email protected]
    [email protected]
    [email protected]
    [email protected]
    [email protected]

     

    I understand about the distance measurement, but how do I take that distance measurement and use it to rearrange the output?

     

    Hope the above makes sense.

     

    Kind Regards

  • DAVID_EALESDAVID_EALES Member Posts: 5 Contributor I

    Many Thanks Lionel, your idea worked.

     

    Kind Regards

  • DAVID_EALESDAVID_EALES Member Posts: 5 Contributor I

    Ok, so the solution proposed by Lionel worked during testing, but I am unable to get it to run through the entire list as I am getting Error 504.

     

    I have split the data into batches of 1000 rows and it all processes fine but I need it to be able to process the entire list of 5k entries at once.

     

    Is this somesort of timeout error? I have looked at the rosette documentation and I cant find any mention of it.

     

    Kind Regards

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 731   Unicorn

    Hi @DAVID_EALES,

     

    Accordind to your last message, It's working for dataset up to 1K rows -->  OK

    But  : normaly, it work with dataset up to 10k rows grasiously (see the documentation (description) of RapidMiner)).

    I contacted the support of Rosette to see what's going on with this error (error504).(maybe an updated limitation...)

     

    Regards,

     

    Lionel

     

    DAVID_EALES
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 731   Unicorn

    Hi @DAVID_EALES,

     

    It seems that your hypothesis is the right one.

    Rosette is working on a fix for the next release. Here the answer of Rosette : 

     

    "Lionel,

    We were able to trace this to an internal issue where our Name Deduplicate endpoint is timing out on large calls.  Our suggestion would be to break the calls up to smaller chunks.   We have an open an internal L3 Issue to correct this in a future release.   Also for future reference here is a link to our error codes.

    https://developer.rosette.com/features-and-functions#errors

    I will hold this ticket open and will provide you a follow on update once we release a complete fix for this issue.

    Best Regards,"

     

    Regards,

     

    Lionel

  • DAVID_EALESDAVID_EALES Member Posts: 5 Contributor I

    Thank You Lionel, much appreciated.

Sign In or Register to comment.