K-means cluster with text data

joen841030 · November 2019

Hello experts!

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997

Image: https://us.v-cdn.net/6030995/uploads/editor/89/5rhn66xvmsgn.png

Also, I wonder is it possible to use something like Silhouette scores to define the ideal number of cluster? Thank you!!!

lionelderkrikor · November 2019

Hi @joen841030,

No, the average within centroid_distance_cluster i is not limited between -1 and +1.
The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.

Here a ressource about average within cluster distance :

https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html

Hope this helps,

Regards,

Lionel

lionelderkrikor · November 2019

Hi @joen841030,

You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) :

https://community.rapidminer.com/discussion/comment/61654#Comment_61654

Hope this helps,

Regards,

Lionel

joen841030 · November 2019

Hi @lionelderkrikor,
thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...

PerformanceVector:
Avg. within centroid distance: -385.889
Avg. within centroid distance_cluster_0: -393.196
Avg. within centroid distance_cluster_1: -351.386
Avg. within centroid distance_cluster_2: -410.075
Avg. within centroid distance_cluster_3: -384.852
Avg. within centroid distance_cluster_4: -403.787
Avg. within centroid distance_cluster_5: -371.171
Avg. within centroid distance_cluster_6: -366.001
Avg. within centroid distance_cluster_7: -402.358
Davies Bouldin: -0.500

And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...

Thanksss so much in advance!

Image: https://us.v-cdn.net/6030995/uploads/editor/70/5aquo37sbqok.png

lionelderkrikor · November 2019

Hi @joen841030,

Why did you think that theses results are incorrect ?

Regards,

Lionel

joen841030 · November 2019

Hi @lionelderkrikor,
Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

K-means cluster with text data

Best Answer

Answers