2 basic questions on agglomerative clustering and CSV processing

jeanluc · February 2010

Hello,

I have 2 basic questions.

Question 1: I have a CSV file whose examples I want to feed into an Agglomerative Clustering. How do I select which column is the one used for the metric? Also, if this column is a timestamp, do I need any extra processing (such as converting into milliseconds)? I chose MeasureType=Numerical, Numerical Measure=Euclidian as these appear to meet my needs (I need to cluster examples by how close they are in time).

Question 2: with the same setup in mind, can I specify a stop condition for the algorithm so it doesn't continue to calculate clusters until the very end (i.e. the one cluster with everything?). I have hundreds of thousands of examples with events in time but the clusters are small (max 15 minutes apart), so it doesn't make sense calculating clusters of hours, days or months (the total span of the records).

Thank you,
-jl

land · February 2010

Hi,
normally all non special attributes are used for calculating the distance. So you have two choices: You could either set all other attributes to be special using the Set Role operator on each of them, or you could simple put the Agglomerative Clustering into a Work on Subset operator, which let's you select the attributes. After the subprocess is executed on the subset, the old attributes are attached to the ExampleSet again. Here's a processes, that will do it this way:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="116" width="279">
      <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
      <operator activated="true" class="work_on_subset" expanded="true" height="94" name="Work on Subset" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="att1"/>
        <parameter key="attributes" value="att3|att2"/>
        <process expanded="true" height="586" width="683">
          <operator activated="true" class="agglomerative_clustering" expanded="true" height="76" name="Clustering" width="90" x="45" y="30"/>
          <connect from_port="exampleSet" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="through 1"/>
          <connect from_op="Clustering" from_port="example set" to_port="example set"/>
          <portSpacing port="source_exampleSet" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Work on Subset" to_port="example set"/>
      <connect from_op="Work on Subset" from_port="example set" to_port="result 1"/>
      <connect from_op="Work on Subset" from_port="through 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Greetings,
Sebastian

jeanluc · February 2010

Sebastian Land wrote:

Hi,
normally all non special attributes are used for calculating the distance. So you have two choices: You could either set all other attributes to be special using the Set Role operator on each of them, or you could simple put the Agglomerative Clustering into a Work on Subset operator, which let's you select the attributes. After the subprocess is executed on the subset, the old attributes are attached to the ExampleSet again.

Hi Sebastian,

Thanks for the help, Work on Subset is very convenient.

Something that still confuses me is why a special attribute "id" appears after the Work On Subset even though no attribute had this role after reading the CSV. The resulting cluster model has a number of clusters that's practically double the number of examples. I have 5 columns in the original CSV, only one numerical one is selected in the properties of the Work On Subset operator, but the preview of the output also shows the "id" attribute being generated. The operator has "keep subset only" enabled. I tried changing the "include special attributes" on and off, but that makes no difference.

Any suggestions are appreciated, I'm still working through my first week with RM.

Work on Subset.example set (example set)
Meta data: Data Table
Number of examples =52
1 attribute: Generated by: Work on Subset.example set ← Work on Subset.exampleSet ← Read CSV.
output Data: NonSpecialAttributesExampleSet: 52 examples, 1 regular attributes, special attributes = { id = #5: id (integer/single_value) }

land · February 2010

Hi,
could you please post me your process? Perhaps there's an error in the meta data transformation, that only occurs under special circumstances.

Greetings,
Sebastian

jeanluc · February 2010

Sebastian Land wrote:

Hi,
could you please post me your process? Perhaps there's an error in the meta data transformation, that only occurs under special circumstances.

Greetings,
Sebastian

Everything is below.

First, the test data set.


"Date","Location","Download","Upload","Latency"
05/02/2010 21:39:00,"Date",4070,351,166
05/02/2010 21:38:00,"home",3793,352,164
05/02/2010 21:38:00,"home",4447,350,169
05/02/2010 21:38:00,"home",3595,350,159
05/02/2010 21:37:00,"home",3077,327,1770
05/02/2010 21:37:00,"home",2230,309,259
05/02/2010 11:52:00,"downtown",76,117,219
05/02/2010 11:52:00,"downtown",163,68,205
05/02/2010 11:51:00,"downtown",723,231,186
05/02/2010 11:51:00,"downtown",377,0,270
04/02/2010 21:50:00,"home",2632,327,165
04/02/2010 21:49:00,"home",2803,328,188
04/02/2010 21:49:00,"home",1586,329,276
04/02/2010 21:48:00,"home",2765,357,218
04/02/2010 21:48:00,"home",1634,198,335
04/02/2010 11:43:00,"downtown",692,255,235
04/02/2010 11:43:00,"downtown",602,113,2717
04/02/2010 11:42:00,"downtown",775,56,239
04/02/2010 11:42:00,"downtown",779,312,8148
04/02/2010 11:41:00,"downtown",225,43,221
04/02/2010 11:41:00,"downtown",471,286,3328
03/02/2010 21:50:00,"home",1239,276,4229
03/02/2010 21:49:00,"home",1339,272,2262
03/02/2010 21:48:00,"home",1600,313,197
03/02/2010 21:47:00,"home",2135,313,187
03/02/2010 21:47:00,"home",2026,269,271
03/02/2010 11:50:00,"downtown",711,266,210
03/02/2010 11:50:00,"downtown",152,315,2638
03/02/2010 11:49:00,"downtown",24,249,301
03/02/2010 11:47:00,"downtown",561,291,1740
03/02/2010 11:47:00,"downtown",863,115,213
02/02/2010 21:54:00,"home",1540,351,200
02/02/2010 21:54:00,"home",1493,285,205
02/02/2010 21:53:00,"home",1606,319,194
02/02/2010 21:53:00,"home",1823,319,174
02/02/2010 21:53:00,"home",2150,250,254
02/02/2010 12:07:00,"downtown",472,273,2266
02/02/2010 12:07:00,"downtown",387,267,2736
02/02/2010 12:06:00,"downtown",381,249,280
02/02/2010 12:04:00,"downtown",312,195,3775
02/02/2010 12:03:00,"downtown",863,260,281
02/02/2010 12:02:00,"downtown",405,111,217
01/02/2010 21:36:00,"home",3326,354,183
01/02/2010 21:36:00,"home",3119,326,172
01/02/2010 21:35:00,"home",3677,330,160
01/02/2010 21:35:00,"home",3151,355,182
01/02/2010 21:35:00,"home",3152,314,282
01/02/2010 11:58:00,"downtown",1244,316,1716
01/02/2010 11:58:00,"downtown",1284,312,192
01/02/2010 11:58:00,"downtown",1211,319,206
01/02/2010 11:57:00,"downtown",900,310,208
01/02/2010 11:57:00,"downtown",683,278,5488

The process:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="463" width="547">
      <operator activated="true" class="read_csv" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="file_name" value="C:\work\m\test.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss.SSS"/>
      </operator>
      <operator activated="true" class="work_on_subset" expanded="true" height="76" name="Work on Subset" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Download"/>
        <parameter key="keep_subset_only" value="true"/>
        <process expanded="true">
          <connect from_port="exampleSet" to_port="example set"/>
          <portSpacing port="source_exampleSet" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="agglomerative_clustering" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
        <parameter key="measure_types" value="NumericalMeasures"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Work on Subset" to_port="example set"/>
      <connect from_op="Work on Subset" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Example set metadata. You can see the extra id attribute.

Cluster Model Text View. Notice there are 2*N-1 clusters, where N is the number of examples.

Also, since we are here, how can I enter a stop condition so clustering doesn't go until the end (when everything has been put in a single cluster). In the real data I will be working on, I'll be interested in clusters with a distance smaller than a certain preset, chosen by the user. The input data will span months and I'm only interested in clustering events that happened within 15 minutes or so.

Thanks again.

land · February 2010

Hi,
the id attribute is automatically added by the clustering algorithm. This is needed to assign an example to an cluster. Hierarchical cluster models always contain 2n -1 entries, because they start with each example being one cluster and then merge two clusters each step. This is performed until only one cluster remains.
This hierarchy might be flatted using the Flatten Clustering operator, which will let the choice, how many clusters you are want to have. If you need it, we could discuss how to add an option for flatten depending on the maximal allowed distance instead of the numbers.

Greetings,
Sebastian

jeanluc · February 2010

Sebastian Land wrote:

Hi,
the id attribute is automatically added by the clustering algorithm. This is needed to assign an example to an cluster. Hierarchical cluster models always contain 2n -1 entries, because they start with each example being one cluster and then merge two clusters each step. This is performed until only one cluster remains.

I see now. I thought this was the number of clusters at the last pass, whereas this is the sum of clusters of all passes.

This hierarchy might be flatted using the Flatten Clustering operator, which will let the choice, how many clusters you are want to have. If you need it, we could discuss how to add an option for flatten depending on the maximal allowed distance instead of the numbers.

I would find such an option very useful. I'm currently exploring what can be done with RM (and not coded explicitly in a custom application). In the real case, I'll have hundreds of thousands of events spread across months but am only concerned about those really clustered together. It's not efficient to continue clustering passed a limit and I cannot present RM as a viable option in that case, even though the rest of the application is better.

Actually, one more question. Consider the examples will be graphed (say, as scatter plots by time or other attributes). Let's assume the stop condition has been implemented and thus a particular example either belongs to a cluster or to none (it was too far from any other event).

How can I use the output of the clustering operator to colour the dots in the scatter plot differently based on their belonging to a cluster or not?

Thank you.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

2 basic questions on agglomerative clustering and CSV processing

Answers