Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Kmeans clustering"
I have the data in the attached csv file. I have to use kmeans for grouping them. I have to make a graph and say how many groups are formed. We have to comment the performance of kmeans and to suggest a better solution. Any ideas???
Tagged:
0
Answers
Difficult to represent your dataset(s) if you work in high dimensionnal space (number of attribute = N).
But you can always represent Attribute i vs Attribute j and in color the class of the label and see if some groups appear...
for example here 2 attributes of the Iris dataset (we see that there are 3 groups):
If you don't know a priori the number of groups (number of clusters) you can try the 2 following models :
- DBSCAN
- X-Means
Hope it helps,
Regards,
Lionel
If you really have no idea where to start, you might want to try the X-Means operator which will use the k-means approach and use many different values for k and choose the one that best satisfies some statistical measures of fit. At least you could use that as a starting point.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
1. I played with your data and I propose humbly an example of presentation using k-Means algorithm :
When you graphically represent your data, we can conclude "visually" that there are 2 clusters (the center and the circle) :
To find a "better solution", you have first to define a performance metrics for your clusters. We can take the Davies Bouldin which mesure the "quality" of your clusters.This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.
In this first case (k = 2), we obtain Davies Bouldin = -0,836
Now, to find a better solution, you can find an other "k". You can find this "better value" by using the Optimize Parameters operator (with a search range of k of [2,8]) :
RM concludes k = 6 and Davies Bouldin = 0,570 => That's much better...! :
Now to go further, a "better solution" means maybe a "better data preparation",
We can for example generate the attributes X and Y with :
X = x*x
Y = y*y
We can relaunch the optimizing process with these new features and we obtain :
k = 3 and Davies Bouldin = 0,457 => That's better...! :
and if we represent graphically these news features, we obtain :
Hope these elements help...
The process :
2.I take advantage of this thread to report that the bug of DBSCAN inside an Optimization Parameters Loop still raises an error.
I described this bug one year ago in this thread.... :
https://community.rapidminer.com/discussion/45555/normal-bug-log-all-criteria-optimization-of-cluster-model
The process :
Lionel
As for the DBSCAN issue, I have pinged @jczogalla in hopes that he can provide an update.
Scott
Regards,
Lionel
lionelderkrikor for your answer! It is very helpful to me!!
Regards,
Lionel
Regards,
Lionel