Options
Why does Xmeans clustering always give kmin as the ideal number of clusters?
Hello everybody,
I am learning about clustering in Rapidminer. Kmeans clustering works fine but you must know the number of clusters you want in advance. Therefore I tried Xmeans but that just always gives the minimumvalue of k as the ideal number of clusters. That can't be right.
As a simple test I entered the following 20 rows:
2,3 4
2 3
1,5 4
2 4
1,5 3,5
2 3,5
12 13
11 12,5
10 14
11 14
12 14
11,5 14,3
10 2
10,2 2,2
9,5 2,4
2 14
2,2 14,2
1,8 13,8
1,9 14,3
11,9 13,9
When you plot those points it is obvious that they form 4 seperate clusters. Then why does Xmeans not find 4 as the ideal number of clusters? If I put k_min equal to 2 it says 2, if I put k_min equal to 3 it says 3...
I am learning about clustering in Rapidminer. Kmeans clustering works fine but you must know the number of clusters you want in advance. Therefore I tried Xmeans but that just always gives the minimumvalue of k as the ideal number of clusters. That can't be right.
As a simple test I entered the following 20 rows:
2,3 4
2 3
1,5 4
2 4
1,5 3,5
2 3,5
12 13
11 12,5
10 14
11 14
12 14
11,5 14,3
10 2
10,2 2,2
9,5 2,4
2 14
2,2 14,2
1,8 13,8
1,9 14,3
11,9 13,9
When you plot those points it is obvious that they form 4 seperate clusters. Then why does Xmeans not find 4 as the ideal number of clusters? If I put k_min equal to 2 it says 2, if I put k_min equal to 3 it says 3...
Tagged:
2
Best Answer

OptionsPaulMSimpson Member Posts: 8 Contributor IIThank you, Martin. From this, we learn that it still pays to take time to experiment. At least, with RapidMiner, we have quick access to a visualization to tell us we need to go back and make some adjustments.7
Answers
I tried using Xmeans, loading in your data set (after converting to dots for the decimal point, as I'm in the US), and I agree with you that the operator did not find 4 clusters that are obvious in a scatter plot, but only the number of clusters that you set the k min to be (The image below shows only 2 clusters, which is what happened when I set k min to 2). Choosing different types of clustering or numerical measures was of no help, either. I looked at the original paper upon which this is based, and I get the idea that it should have split to find 4 clusters. So, good question  I'm stumped on this one, and I await a good answer, along with you. (Note: I have attached an Excel file of your values, as two columns, with "," replaced by "." for the decimal points)
to add to @PaulMSimpson 's comment: It looks like you did not normalize before applying XMeans? Please remember that distance based algorithms always need normalisation.
Cheers,
Martin
Dortmund, Germany
mschmitz: I tried your suggestion of normalizing before applying Xmeans but it still does not work.
Here is the flowdiagram I use in Rapidminer. Test contains the 20 rows I posted above. I apply the normalize operator to both attributes. Kmin is set to 2 and the result of Xmeans will still give 2 clusters instead of 4. What am I missing?
XMeans is using a heuristic to determine k in kMeans. If you look at your data one may argue that in your case either 2 or 4 clusters are "correct". XMeans decided for 2, which feels okayish.
This is the point of heuristics. They often work but sometimes go "wrong". That's why people often enough use kMeans and not only XMeans.
Cheers,
Martin
Dortmund, Germany
Dortmund, Germany
I have question in connection to this thread:
I tried XMeans between the interval kmin=2 and kmax=60 as well as with kmin=20 and kmax=60 on my data. The xmeans model gives me the minimal number of k (in the first time k=2 and in the second time k=20) every time. Is it normal that xMeans always picks the minimal number of k?
The data input is represented by TFIDF values.
Best regards!
The situation you stated can happen if you don't have too many examples for clustering, or they are simply too similar to one another so the Xmeans always resorts to the simplest clustering scheme.
In such case it is better to normalize the data beforehand. This will ensure all the attributes arrives at the same scale before the algorithm is applied.
For e.g. attribute1 has data range 0100 and attribute2 has vector range 01. Now in this case attribute1 gets more weightage than attribute2. But if you apply normalise both attributes will covert to 01 scale.
Rapidminer Operator to be used : "Normalize"