Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
K-means cluster with text data
joen841030
Member Posts: 8 Contributor I
Hello experts!
I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??
Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997
Also, I wonder is it possible to use something like Silhouette scores to define the ideal number of cluster? Thank you!!!
I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??
Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997
Also, I wonder is it possible to use something like Silhouette scores to define the ideal number of cluster? Thank you!!!
Tagged:
0
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @joen841030,
No, the average within centroid_distance_cluster i is not limited between -1 and +1.
The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.
Here a ressource about average within cluster distance :
https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html
Hope this helps,
Regards,
Lionel
7
Answers
You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) :
https://community.rapidminer.com/discussion/comment/61654#Comment_61654
Hope this helps,
Regards,
Lionel
thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...
PerformanceVector:
Avg. within centroid distance: -385.889
Avg. within centroid distance_cluster_0: -393.196
Avg. within centroid distance_cluster_1: -351.386
Avg. within centroid distance_cluster_2: -410.075
Avg. within centroid distance_cluster_3: -384.852
Avg. within centroid distance_cluster_4: -403.787
Avg. within centroid distance_cluster_5: -371.171
Avg. within centroid distance_cluster_6: -366.001
Avg. within centroid distance_cluster_7: -402.358
Davies Bouldin: -0.500
And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...
Thanksss so much in advance!
Why did you think that theses results are incorrect ?
Regards,
Lionel
Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!