Options

"Clustering algorithm for grouping records of similar shapes/characteristics?"

ruserruser Member Posts: 40 Maven
edited May 2019 in Help
I have the following set of records:
-----
1 2 3 4 5 6
15 30 45 60 75 90
5 2 6 3 4 5
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60

-----

All the attributes are numeric. Here, I would like to cluster the records which are similar in shapes in a single cluster. This means, for the above data set, all the records (except the 3rd one) should be grouped in the same cluster due to the close similarity and the 3rd record in the 2nd cluster. Which clustering algorithm I should use to get such a clustering? Kindly help!

Answers

  • Options
    haddockhaddock Member Posts: 849 Maven
    G'Day,
    Which clustering algorithm I should use to get such a clustering?
    This is a silly question, because it allows the following silly answer..
    Whichever algorithm gives such a clustering...
    How can anyone answer without running all the available algorithms on that data?
  • Options
    ruserruser Member Posts: 40 Maven
    1  2  3  4  5  6
    15  30  45  60  75  90
    5  2  6  3  4  5


    If you consider the K-means algorithm, it measures the distance between different attribute values. As I am aware, many of the Clustering algorithms Cluster the data based on the distance criteria.
    I am looking for an algorithm which is more sensible than just looking at the distance.

    If you look at the above records, the 1st and 2nd records are completely similar (values of all the attributes are distributed exactly similar) than 1st and 3rd one, eventhough 1st and 3rd ones are closer in terms of distance. I am looking for such a clustering algorithm. As I couldn't come to know of any such Clustering algorithms, I posted it here (as I thought that such sensible algorithms should exist).

    In my perspective, it is better to know about the algorithm first, than just running and getting puzzled about too many different results after that.

    Can you help? Thanks!
  • Options
    bheckelbheckel Member Posts: 2 Contributor I
    Clustering is usually based on a similarity metric that measures how similar pairs of data vectors as well as cluster representations (aka centroids) and data vectors are related.  The similarity measure could operate on numerical values and be based on Euclidean distance - meaning how far apart items are in n-dimensional space - but it does not have to be.  It could be whatever you want it to be to solve a particular problem.  The outcome obviously also depends on the algorithm and how it is configured, for example, how many clusters you choose to generate.

    It's hard to come up with an answer for what you described, since it is not clear what the data means.  For example, what does each data element stand for?  Is this a sparse data representation, for example, customer 1 bought products 1, 2, 3, 4, 5 and 6?  Or does it mean that customer 1 bought product 1 once, product 2 twice, etc?

    Overall, what approach you take really depends on what kind of data you have available and what problem you are trying to solve.



  • Options
    ruserruser Member Posts: 40 Maven
    The problem is related to finding the parts of the network where the traffic delay distribution is similar. Each of the attributes in the example represent the no. of samples which fall in a specific delay group i.e. 1st attribute for 5-10 ms, 2nd for 11-15 ms etc.
    1st part of network (21 samples)  :  1 /  2 /  3 /  4 /    5 /  6
    2nd part of network (315 samples): 15 / 30 / 45 / 60 /  75 /  90
    3rd part of network (25 samples):    5 /  2 /  6 /  3  /  4  / 5

    Given the no. of samples, the 1st and 2nd part of the network behave exactly in a similar way i.e. their delay distribution is same/similar.
    But, I fear that the 'Clustering' algorithms tend to cluster in a different way. This is the reason why I am asking whether we have such algorithms available for the Clustering purpose.

    Does it make my problem clear? Thanks!
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if I understand correctly, you are going to group the examples together, which vectors are scaled variants of each other, you simply could use KMedoids with the cosine similarity as measure. This will group examples together, depending on the angle between the example vectors.

    Greetings,
      Sebastian
  • Options
    ruserruser Member Posts: 40 Maven
    Thanks Sebastian! That seems to closely fit for my requirement. I still have some concerns on the result that was produced.
    I tried with this data:
    Delay class-0 Delay class-1 Delay class-2 Delay class-3 Delay class-4 Delay class-5
    1 2 3 4 5 6
    15 30 45 60 75 90
    5 2 6 3 4 5
    10 20 30 40 50 60
    5 10 15 20 25 30
    20 40 60 80 100 120
    11 22 33 44 55 66
    2 4 6 8 10 12
    100 200 300 400 500 600
    50 100 150 200 250 300
    10 4 12 6 8 10
    3 4 4 6 5 1

    and the KMedoids with 'CosineSimilarity' produced this result:
    Delay class-0 Delay class-1 Delay class-2 Delay class-3 Delay class-4 Delay class-5
    cluster_1 1 2 3 4 5 6
    cluster_0 15 30 45 60 75 90
    cluster_2 5 2 6 3 4 5
    cluster_1 10 20 30 40 50 60
    cluster_1 5 10 15 20 25 30
    cluster_1 20 40 60 80 100 120
    cluster_1 11 22 33 44 55 66
    cluster_1 2 4 6 8 10 12
    cluster_1 100 200 300 400 500 600
    cluster_1 50 100 150 200 250 300
    cluster_2 10 4 12 6 8 10
    cluster_0 3 4 4 6 5 1

    I expected that the 2nd row with values
    15 30 45 60 75 90
    is clustered into the Cluster-1. Why did it fail to do so? Do you have some information? I presume that it is a bug with the implementation. Please confirm.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    first of all: Not everything not being as you wish it to be it a bug, ok? Sometimes reality simply doesn't match the algorithms. And sometime users simply don't know what they can expect, because they don't know the algorithms at all. So please before you state that something is a bug, you should at least know what you are doing.

    So why did this behavior occur?
    KMedoids as KMeans heavily depends on the random initialization. Might be, that your second example is selected as initial start medoid of the cluster. And might be that due to the fact of the very small data set it never gets out of this local optima. So go ahead and try with more data or change the local random seed. If all of this does not work, you should google for the cosine similarity and test if the outcome of the distance function would be different if you calculate it in your head using a sheet of paper and a calculator. You could compare it with rapidminer results using the ExampleSet2Similiarity operator. And if this results are unexplainable, then you could complain about a bug and I will immediately go into the code and fix it.

    Greetings,
      Sebastian

  • Options
    ruserruser Member Posts: 40 Maven
    understood! Thanks for the clear-cut information.
    From next time onwards, I will ensure that I go into all the details before presuming/claiming something as a bug.
Sign In or Register to comment.