2 basic questions on agglomerative clustering and CSV processing

jeanlucjeanluc Member Posts: 18  Maven
edited July 10 in Help
Hello,

I have 2 basic questions.

Question 1: I have a CSV file whose examples I want to feed into an Agglomerative Clustering. How do I select which column is the one used for the metric? Also, if this column is a timestamp, do I need any extra processing (such as converting into milliseconds)? I chose MeasureType=Numerical, Numerical Measure=Euclidian as these appear to meet my needs (I need to cluster examples by how close they are in time).

Question 2: with the same setup in mind, can I specify a stop condition for the algorithm so it doesn't continue to calculate clusters until the very end (i.e. the one cluster with everything?). I have hundreds of thousands of examples with events in time but the clusters are small (max 15 minutes apart), so it doesn't make sense calculating clusters of hours, days or months (the total span of the records).

Thank you,
-jl
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    normally all non special attributes are used for calculating the distance. So you have two choices: You could either set all other attributes to be special using the Set Role operator on each of them, or you could simple put the Agglomerative Clustering into a Work on Subset operator, which let's you select the attributes. After the subprocess is executed on the subset, the old attributes are attached to the ExampleSet again. Here's a processes, that will do it this way:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="116" width="279">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
          <operator activated="true" class="work_on_subset" expanded="true" height="94" name="Work on Subset" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
            <parameter key="attributes" value="att3|att2"/>
            <process expanded="true" height="586" width="683">
              <operator activated="true" class="agglomerative_clustering" expanded="true" height="76" name="Clustering" width="90" x="45" y="30"/>
              <connect from_port="exampleSet" to_op="Clustering" to_port="example set"/>
              <connect from_op="Clustering" from_port="cluster model" to_port="through 1"/>
              <connect from_op="Clustering" from_port="example set" to_port="example set"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <portSpacing port="sink_through 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="result 1"/>
          <connect from_op="Work on Subset" from_port="through 1" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian
  • jeanlucjeanluc Member Posts: 18  Maven
    Sebastian Land wrote:

    Hi,
    normally all non special attributes are used for calculating the distance. So you have two choices: You could either set all other attributes to be special using the Set Role operator on each of them, or you could simple put the Agglomerative Clustering into a Work on Subset operator, which let's you select the attributes. After the subprocess is executed on the subset, the old attributes are attached to the ExampleSet again.
    Hi Sebastian,

    Thanks for the help, Work on Subset is very convenient.

    Something that still confuses me is why a special attribute "id" appears after the Work On Subset even though no attribute had this role after reading the CSV. The resulting cluster model has a number of clusters that's practically double the number of examples. I have 5 columns in the original CSV, only one numerical one is selected in the properties of the Work On Subset operator, but the preview of the output also shows the "id" attribute being generated. The operator has "keep subset only" enabled. I tried changing the "include special attributes" on and off, but that makes no difference.

    Any suggestions are appreciated, I'm still working through my first week with RM.

    Work on Subset.example set (example set)
    Meta data: Data Table
    Number of examples =52
    1 attribute: Generated by: Work on Subset.example set ← Work on Subset.exampleSet ← Read CSV.
    output Data: NonSpecialAttributesExampleSet: 52 examples, 1 regular attributes, special attributes = { id = #5: id (integer/single_value) }

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    could you please post me your process? Perhaps there's an error in the meta data transformation, that only occurs under special circumstances.

    Greetings,
      Sebastian
  • jeanlucjeanluc Member Posts: 18  Maven
    Sebastian Land wrote:

    Hi,
    could you please post me your process? Perhaps there's an error in the meta data transformation, that only occurs under special circumstances.

    Greetings,
      Sebastian
    Everything is below.

    First, the test data set.

    "Date","Location","Download","Upload","Latency"
    05/02/2010 21:39:00,"Date",4070,351,166
    05/02/2010 21:38:00,"home",3793,352,164
    05/02/2010 21:38:00,"home",4447,350,169
    05/02/2010 21:38:00,"home",3595,350,159
    05/02/2010 21:37:00,"home",3077,327,1770
    05/02/2010 21:37:00,"home",2230,309,259
    05/02/2010 11:52:00,"downtown",76,117,219
    05/02/2010 11:52:00,"downtown",163,68,205
    05/02/2010 11:51:00,"downtown",723,231,186
    05/02/2010 11:51:00,"downtown",377,0,270
    04/02/2010 21:50:00,"home",2632,327,165
    04/02/2010 21:49:00,"home",2803,328,188
    04/02/2010 21:49:00,"home",1586,329,276
    04/02/2010 21:48:00,"home",2765,357,218
    04/02/2010 21:48:00,"home",1634,198,335
    04/02/2010 11:43:00,"downtown",692,255,235
    04/02/2010 11:43:00,"downtown",602,113,2717
    04/02/2010 11:42:00,"downtown",775,56,239
    04/02/2010 11:42:00,"downtown",779,312,8148
    04/02/2010 11:41:00,"downtown",225,43,221
    04/02/2010 11:41:00,"downtown",471,286,3328
    03/02/2010 21:50:00,"home",1239,276,4229
    03/02/2010 21:49:00,"home",1339,272,2262
    03/02/2010 21:48:00,"home",1600,313,197
    03/02/2010 21:47:00,"home",2135,313,187
    03/02/2010 21:47:00,"home",2026,269,271
    03/02/2010 11:50:00,"downtown",711,266,210
    03/02/2010 11:50:00,"downtown",152,315,2638
    03/02/2010 11:49:00,"downtown",24,249,301
    03/02/2010 11:47:00,"downtown",561,291,1740
    03/02/2010 11:47:00,"downtown",863,115,213
    02/02/2010 21:54:00,"home",1540,351,200
    02/02/2010 21:54:00,"home",1493,285,205
    02/02/2010 21:53:00,"home",1606,319,194
    02/02/2010 21:53:00,"home",1823,319,174
    02/02/2010 21:53:00,"home",2150,250,254
    02/02/2010 12:07:00,"downtown",472,273,2266
    02/02/2010 12:07:00,"downtown",387,267,2736
    02/02/2010 12:06:00,"downtown",381,249,280
    02/02/2010 12:04:00,"downtown",312,195,3775
    02/02/2010 12:03:00,"downtown",863,260,281
    02/02/2010 12:02:00,"downtown",405,111,217
    01/02/2010 21:36:00,"home",3326,354,183
    01/02/2010 21:36:00,"home",3119,326,172
    01/02/2010 21:35:00,"home",3677,330,160
    01/02/2010 21:35:00,"home",3151,355,182
    01/02/2010 21:35:00,"home",3152,314,282
    01/02/2010 11:58:00,"downtown",1244,316,1716
    01/02/2010 11:58:00,"downtown",1284,312,192
    01/02/2010 11:58:00,"downtown",1211,319,206
    01/02/2010 11:57:00,"downtown",900,310,208
    01/02/2010 11:57:00,"downtown",683,278,5488
    The process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="463" width="547">
          <operator activated="true" class="read_csv" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="file_name" value="C:\work\m\test.csv"/>
            <parameter key="column_separators" value=","/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss.SSS"/>
          </operator>
          <operator activated="true" class="work_on_subset" expanded="true" height="76" name="Work on Subset" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Download"/>
            <parameter key="keep_subset_only" value="true"/>
            <process expanded="true">
              <connect from_port="exampleSet" to_port="example set"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="agglomerative_clustering" expanded="true" height="76" name="Clustering" width="90" x="380" y="30">
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="example set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    Example set metadata. You can see the extra id attribute.

    image

    Cluster Model Text View. Notice there are 2*N-1 clusters, where N is the number of examples.

    image

    Also, since we are here,  how can I enter a stop condition so clustering doesn't go until the end (when everything has been put in a single cluster). In the real data I will be working on, I'll be interested in clusters with a distance smaller than a certain preset, chosen by the user. The input data will span months and I'm only interested in clustering events that happened within 15 minutes or so.

    Thanks again.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    the id attribute is automatically added by the clustering algorithm. This is needed to assign an example to an cluster. Hierarchical cluster models always contain 2n -1 entries, because they start with each example being one cluster and then merge two clusters each step. This is performed until only one cluster remains.
    This hierarchy might be flatted using the Flatten Clustering operator, which will let the choice, how many clusters you are want to have. If you need it, we could discuss how to add an option for flatten depending on the maximal allowed distance instead of the numbers.

    Greetings,
      Sebastian
  • jeanlucjeanluc Member Posts: 18  Maven
    Sebastian Land wrote:

    Hi,
    the id attribute is automatically added by the clustering algorithm. This is needed to assign an example to an cluster. Hierarchical cluster models always contain 2n -1 entries, because they start with each example being one cluster and then merge two clusters each step. This is performed until only one cluster remains.
    I see now. I thought this was the number of clusters at the last pass, whereas this is the sum of clusters of all passes.

    This hierarchy might be flatted using the Flatten Clustering operator, which will let the choice, how many clusters you are want to have. If you need it, we could discuss how to add an option for flatten depending on the maximal allowed distance instead of the numbers.
    I would find such an option very useful. I'm currently exploring what can be done with RM (and not coded explicitly in a custom application). In the real case, I'll have hundreds of thousands of events spread across months but am only concerned about those really clustered together. It's not efficient to continue clustering passed a limit and I cannot present RM as a viable option in that case, even though the rest of the application is better.

    Actually, one more question. Consider the examples will be graphed (say, as scatter plots by time or other attributes). Let's assume the stop condition has been implemented and thus a particular example either belongs to a cluster or to none (it was too far from any other event).

    How can I use the output of the clustering operator to colour the dots in the scatter plot differently based on their belonging to a cluster or not?

    Thank you.
Sign In or Register to comment.