Why does X-means clustering always give k-min as the ideal number of clusters?

fbrandsefbrandse Member Posts: 4 Contributor I
edited February 2020 in Help
Hello everybody,

I am learning about clustering in Rapidminer. K-means clustering works fine but you must know the number of clusters you want in advance. Therefore I tried X-means but that just always gives the minimumvalue of k as the ideal number of clusters. That can't be right.
As a simple test I entered the following 20 rows:

2,3      4
2         3
1,5      4
2         4
1,5      3,5
2         3,5
12       13
11       12,5
10       14
11       14
12       14
11,5    14,3
10       2
10,2    2,2
9,5      2,4
2         14
2,2      14,2
1,8      13,8
1,9      14,3
11,9    13,9

When you plot those points it is obvious that they form 4 seperate clusters. Then why does X-means not find 4 as the ideal number of clusters? If I put k_min equal to 2 it says 2, if I put k_min equal to 3 it says 3...

Tagged:

Best Answer

Answers

  • PaulMSimpsonPaulMSimpson Member Posts: 8 Contributor II
    Hello fbrandse,

    I tried using X-means, loading in your data set (after converting to dots for the decimal point, as I'm in the US), and I agree with you that the operator did not find 4 clusters that are obvious in a scatter plot, but only the number of clusters that you set the k min to be (The image below shows only 2 clusters, which is what happened when I set k min to 2). Choosing different types of clustering or numerical measures was of no help, either. I looked at the original paper upon which this is based, and I get the idea that it should have split to find 4 clusters. So, good question - I'm stumped on this one, and I await a good answer, along with you.  (Note: I have attached an Excel file of your values, as two columns, with "," replaced by "." for the decimal points)


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi @fbrandse,
    to add to @PaulMSimpson 's comment: It looks like you did not normalize before applying X-Means? Please remember that distance based algorithms always need normalisation.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fbrandsefbrandse Member Posts: 4 Contributor I
    PaulMSimpson: thank you for spending some time with me to solve this problem.
    mschmitz: I tried your suggestion of normalizing before applying X-means but it still does not work.
    Here is the flowdiagram I use in Rapidminer. Test contains the 20 rows I posted above. I apply the normalize operator to both attributes. K-min is set to 2 and the result of X-means will still give 2 clusters instead of 4. What am I missing?



  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi @fbrandse,

    X-Means is using a heuristic to determine k in k-Means. If you look at your data one may argue that in your case either 2 or 4 clusters are "correct". X-Means decided for 2, which feels okayish.

    This is the point of heuristics. They often work but sometimes go "wrong". That's why people often enough use k-Means and not only X-Means.

    Cheers,
    Martin 
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • fbrandsefbrandse Member Posts: 4 Contributor I
    mschmitz : Ok, thank you Martin. I looked up the definition of "heuristic" in the dictionary and apparently it means something like "problem solving by experimental or trial-and-error methods". I thought the X-means operator worked with a well defined algorithm which produces unambiguous results, but clearly that is not the case. It is good to keep that in mind when using this operator. 
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    X-Means is definetly a well-rounded operator. The algorithm uses internally a way to figure out the right k. This method is very often correct, so it is definetly a good idea to try it. That's usually the point of heurstics. They often work well and you should start with them, but you should also be aware, that they sometimes "go off". You often have heuristics used in the default settings of operators. 100 trees in a random forest are often a good choice. Sometimes you need more. Same story.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hi @mschmitz

    I have question in connection to this thread: 

    I tried X-Means between the interval k-min=2 and k-max=60 as well as with k-min=20 and k-max=60 on my data. The x-means model gives me the minimal number of k (in the first time k=2 and in the second time k=20) every time. Is it normal that x-Means always picks the minimal number of k? 


    The data input is represented by TF-IDF values. 

    Best regards!   
  • mantanzmantanz Member Posts: 8 Contributor II
    @Muhammed_Fatih_

    The situation you stated can happen if you don't have too many examples for clustering, or they are simply too similar to one another so the X-means always resorts to the simplest clustering scheme.
    In such case it is better to normalize the data beforehand. This will ensure all the attributes arrives at the same scale before the algorithm is applied.
    For e.g. attribute1 has data range 0-100 and attribute2 has vector range 0-1. Now in this case attribute1 gets more weightage than attribute2. But if you apply normalise both attributes will covert to 0-1 scale.

    Rapidminer Operator to be used : "Normalize"
Sign In or Register to comment.