Options

# K-means cluster with text data

Member Posts: 8 Contributor II
edited November 2019 in Help
Hello experts!

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997

Also, I wonder is it possible to use something like
Silhouette  scores to define the ideal number of cluster? Thank you!!!

• Options
Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @joen841030,

You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) :

https://community.rapidminer.com/discussion/comment/61654#Comment_61654

Hope this helps,

Regards,

Lionel
• Options
Member Posts: 8 Contributor II
Hi @lionelderkrikor
thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...

PerformanceVector:
Avg. within centroid distance: -385.889
Avg. within centroid distance_cluster_0: -393.196
Avg. within centroid distance_cluster_1: -351.386
Avg. within centroid distance_cluster_2: -410.075
Avg. within centroid distance_cluster_3: -384.852
Avg. within centroid distance_cluster_4: -403.787
Avg. within centroid distance_cluster_5: -371.171
Avg. within centroid distance_cluster_6: -366.001
Avg. within centroid distance_cluster_7: -402.358
Davies Bouldin: -0.500

And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...

• Options
Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @joen841030,

Why did you think that theses results are incorrect ?

Regards,

Lionel
• Options
Member Posts: 8 Contributor II
Hi @lionelderkrikor,
Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!