"Clustering text data"

aquavitae · October 2012

Hi

I have a table of text data, specifically contact information (e.g. names, addresses, phone numbers etc). All this data has entered manually, some of it is missing and there is a good chance that there are duplicate examples with slightly different values e.g. different capitalization, spaces in the phone numbers, full name vs initials. I need to scan through this and generate a list of examples which appear to be similar. My first approach was to use the "Data To Similarity" operator to list pairs with a similarity higher than 0.9, but this doesn't quite give the results I expect. This may be partly because I'm not sure which measure type to use, but I think it was also because it didn't take into account things like mismatched case. My second attempt was to use the text processing tools, processing the data using "Process Documents from Data". However, this appears to concatenate all attributes within an example. I'm pretty sure that this is the approach I need to take, but I am stuck on a few points:

1. How do I deal with missing data? Ideally, examples should not be compared on attributes which are missing.
2. As far as I understand, "Process Documents from Data" concatenates attributes, but I want to compare individual attributes in the examples. E.g. two similar names can match, but a name which is similar to an address shouldn't.
3. What model is appropriate for clustering the output from "Process Documents from Data"? I don't know the number of clusters, since it depends on how similar the examples are so I can't use k-means. In a previous attempt with only one attribute I used DBSCAN, which worked well, but took a very long time to process.

I don't know how much use this is, but here is the XML as it is at the moment. I have sampled 100 examples for testing.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="446" width="705">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//DB/zacptpgis01/Example Sets/dbo.tbl_Contacts"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="112" y="165">
        <parameter key="name" value="CID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="sample" compatibility="5.2.008" expanded="true" height="76" name="Sample" width="90" x="179" y="30">
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
        <list key="specify_weights"/>
        <process expanded="true" height="446" width="725">
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="30"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="dbscan" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="581" y="30">
        <parameter key="min_points" value="1"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Sample" to_port="example set input"/>
      <connect from_op="Sample" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

lovefinearts198 · November 2012

Hello,

1. How do I deal with missing data? Ideally, examples should not be compared on attributes which are missing.

You can use a component that replace missing data with a value you can set (numerical or nominal)

Regards,

MariusHelf · November 2012

Hi,

maybe you should treat the attributes one by one, i.e. use a Process Documents operator for each attribute. Then you should define custom rules for each attribute, e.g. remove spaces, slashes and dashes from the phone number field, transform names to lowercase etc.

Furthermore, instead of clustering you could also try the Cross-Similarity operator, with the same exampleset connected to both input. That will calculate the similarity of each example to each other example in the set (beware: the new dataset will contain n*n examples, where n is the number of examples in the original data set). The similarities operator should ignore missing values.

Best,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Clustering text data"

Answers