Options

# 2 basic questions on agglomerative clustering and CSV processing

Hello,

I have 2 basic questions.

Question 1: I have a CSV file whose examples I want to feed into an Agglomerative Clustering. How do I select which column is the one used for the metric? Also, if this column is a timestamp, do I need any extra processing (such as converting into milliseconds)? I chose MeasureType=Numerical, Numerical Measure=Euclidian as these appear to meet my needs (I need to cluster examples by how close they are in time).

Question 2: with the same setup in mind, can I specify a stop condition for the algorithm so it doesn't continue to calculate clusters until the very end (i.e. the one cluster with everything?). I have hundreds of thousands of examples with events in time but the clusters are small (max 15 minutes apart), so it doesn't make sense calculating clusters of hours, days or months (the total span of the records).

Thank you,

-jl

I have 2 basic questions.

Question 1: I have a CSV file whose examples I want to feed into an Agglomerative Clustering. How do I select which column is the one used for the metric? Also, if this column is a timestamp, do I need any extra processing (such as converting into milliseconds)? I chose MeasureType=Numerical, Numerical Measure=Euclidian as these appear to meet my needs (I need to cluster examples by how close they are in time).

Question 2: with the same setup in mind, can I specify a stop condition for the algorithm so it doesn't continue to calculate clusters until the very end (i.e. the one cluster with everything?). I have hundreds of thousands of examples with events in time but the clusters are small (max 15 minutes apart), so it doesn't make sense calculating clusters of hours, days or months (the total span of the records).

Thank you,

-jl

Tagged:

0

## Answers

2,531Unicornnormally all non special attributes are used for calculating the distance. So you have two choices: You could either set all other attributes to be special using the Set Role operator on each of them, or you could simple put the Agglomerative Clustering into a Work on Subset operator, which let's you select the attributes. After the subprocess is executed on the subset, the old attributes are attached to the ExampleSet again. Here's a processes, that will do it this way: Greetings,

Sebastian

18Contributor IIThanks for the help, Work on Subset is very convenient.

Something that still confuses me is why a special attribute "id" appears after the Work On Subset even though no attribute had this role after reading the CSV. The resulting cluster model has a number of clusters that's practically double the number of examples. I have 5 columns in the original CSV, only one numerical one is selected in the properties of the Work On Subset operator, but the preview of the output also shows the "id" attribute being generated. The operator has "keep subset only" enabled. I tried changing the "include special attributes" on and off, but that makes no difference.

Any suggestions are appreciated, I'm still working through my first week with RM.

Work on Subset.example set (example set)

Meta data: Data Table

Number of examples =52

1 attribute: Generated by: Work on Subset.example set ← Work on Subset.exampleSet ← Read CSV.

output Data: NonSpecialAttributesExampleSet: 52 examples, 1 regular attributes, special attributes = { id = #5: id (integer/single_value) }

2,531Unicorncould you please post me your process? Perhaps there's an error in the meta data transformation, that only occurs under special circumstances.

Greetings,

Sebastian

18Contributor IIFirst, the test data set. The process: Example set metadata. You can see the extra id attribute.

Cluster Model Text View. Notice there are 2*N-1 clusters, where N is the number of examples.

Also, since we are here, how can I enter a stop condition so clustering doesn't go until the end (when everything has been put in a single cluster). In the real data I will be working on, I'll be interested in clusters with a distance smaller than a certain preset, chosen by the user. The input data will span months and I'm only interested in clustering events that happened within 15 minutes or so.

Thanks again.

2,531Unicornthe id attribute is automatically added by the clustering algorithm. This is needed to assign an example to an cluster. Hierarchical cluster models always contain 2n -1 entries, because they start with each example being one cluster and then merge two clusters each step. This is performed until only one cluster remains.

This hierarchy might be flatted using the Flatten Clustering operator, which will let the choice, how many clusters you are want to have. If you need it, we could discuss how to add an option for flatten depending on the maximal allowed distance instead of the numbers.

Greetings,

Sebastian

18Contributor IIlastpass, whereas this is the sum of clusters of all passes. I would find such an option very useful. I'm currently exploring what can be done with RM (and not coded explicitly in a custom application). In the real case, I'll have hundreds of thousands of events spread across months but am only concerned about those really clustered together. It's not efficient to continue clustering passed a limit and I cannot present RM as a viable option in that case, even though the rest of the application is better.Actually, one more question. Consider the examples will be graphed (say, as scatter plots by time or other attributes). Let's assume the stop condition has been implemented and thus a particular example either belongs to a cluster or to none (it was too far from any other event).

How can I use the output of the clustering operator to

colourthe dots in the scatter plot differently based on their belonging to a cluster or not?Thank you.