Options

# "Clustering algorithm for grouping records of similar shapes/characteristics?"

I have the following set of records:

-----

-----

All the attributes are numeric. Here, I would like to cluster the records which are similar in shapes in a single cluster. This means, for the above data set, all the records (except the 3rd one) should be grouped in the same cluster due to the close similarity and the 3rd record in the 2nd cluster. Which clustering algorithm I should use to get such a clustering? Kindly help!

-----

*1 2 3 4 5 6*

15 30 45 60 75 90

5 2 6 3 4 5

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 6015 30 45 60 75 90

5 2 6 3 4 5

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

-----

All the attributes are numeric. Here, I would like to cluster the records which are similar in shapes in a single cluster. This means, for the above data set, all the records (except the 3rd one) should be grouped in the same cluster due to the close similarity and the 3rd record in the 2nd cluster. Which clustering algorithm I should use to get such a clustering? Kindly help!

Tagged:

0

## Answers

849Maven40Maven1 2 3 4 5 615 30 45 60 75 90

5 2 6 3 4 5

If you consider the K-means algorithm, it measures the distance between different attribute values. As I am aware, many of the Clustering algorithms Cluster the data based on the distance criteria.

I am looking for an algorithm which is more sensible than just looking at the distance.

If you look at the above records, the 1st and 2nd records are completely similar (values of all the attributes are distributed exactly similar) than 1st and 3rd one, eventhough 1st and 3rd ones are closer in terms of distance. I am looking for such a clustering algorithm. As I couldn't come to know of any such Clustering algorithms, I posted it here (as I thought that such sensible algorithms should exist).

In my perspective, it is better to know about the algorithm first, than just running and getting puzzled about too many different results after that.

Can you help? Thanks!

2Contributor IIt's hard to come up with an answer for what you described, since it is not clear what the data means. For example, what does each data element stand for? Is this a sparse data representation, for example, customer 1 bought products 1, 2, 3, 4, 5 and 6? Or does it mean that customer 1 bought product 1 once, product 2 twice, etc?

Overall, what approach you take really depends on what kind of data you have available and what problem you are trying to solve.

40Maven1st part of network (21 samples) :

1 / 2 / 3 / 4 / 5 / 62nd part of network (315 samples):

15 / 30 / 45 / 60 / 75 / 903rd part of network (25 samples):

5 / 2 / 6 / 3 / 4 / 5Given the no. of samples, the 1st and 2nd part of the network behave exactly in a similar way i.e. their delay distribution is same/similar.

But, I fear that the 'Clustering' algorithms tend to cluster in a different way. This is the reason why I am asking whether we have such algorithms available for the Clustering purpose.

Does it make my problem clear? Thanks!

2,531Unicornif I understand correctly, you are going to group the examples together, which vectors are scaled variants of each other, you simply could use KMedoids with the cosine similarity as measure. This will group examples together, depending on the angle between the example vectors.

Greetings,

Sebastian

40MavenI tried with this data:

1 2 3 4 5 6

15 30 45 60 75 90

5 2 6 3 4 5

10 20 30 40 50 60

5 10 15 20 25 30

20 40 60 80 100 120

11 22 33 44 55 66

2 4 6 8 10 12

100 200 300 400 500 600

50 100 150 200 250 300

10 4 12 6 8 10

3 4 4 6 5 1

cluster_1 1 2 3 4 5 6

cluster_0 15 30 45 60 75 90

cluster_2 5 2 6 3 4 5

cluster_1 10 20 30 40 50 60

cluster_1 5 10 15 20 25 30

cluster_1 20 40 60 80 100 120

cluster_1 11 22 33 44 55 66

cluster_1 2 4 6 8 10 12

cluster_1 100 200 300 400 500 600

cluster_1 50 100 150 200 250 300

cluster_2 10 4 12 6 8 10

cluster_0 3 4 4 6 5 1

2,531Unicornfirst of all: Not everything not being as you wish it to be it a bug, ok? Sometimes reality simply doesn't match the algorithms. And sometime users simply don't know what they can expect, because they don't know the algorithms at all. So please before you state that something is a bug, you should at least know what you are doing.

So why did this behavior occur?

KMedoids as KMeans heavily depends on the random initialization. Might be, that your second example is selected as initial start medoid of the cluster. And might be that due to the fact of the very small data set it never gets out of this local optima. So go ahead and try with more data or change the local random seed. If all of this does not work, you should google for the cosine similarity and test if the outcome of the distance function would be different if you calculate it in your head using a sheet of paper and a calculator. You could compare it with rapidminer results using the ExampleSet2Similiarity operator. And if this results are unexplainable, then you could complain about a bug and I will immediately go into the code and fix it.

Greetings,

Sebastian

40MavenFrom next time onwards, I will ensure that I go into all the details before presuming/claiming something as a bug.