Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

K-means cluster with text data

joen841030joen841030 Member Posts: 8 Contributor I
edited November 2019 in Help
Hello experts! 

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997



Also, I wonder is it possible to use something like 
Silhouette  scores to define the ideal number of cluster? Thank you!!!

Best Answer

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @joen841030,

    You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) : 

    https://community.rapidminer.com/discussion/comment/61654#Comment_61654

    Hope this helps,

    Regards,

    Lionel
  • joen841030joen841030 Member Posts: 8 Contributor I
    Hi @lionelderkrikor
    thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...

    PerformanceVector:
    Avg. within centroid distance: -385.889
    Avg. within centroid distance_cluster_0: -393.196
    Avg. within centroid distance_cluster_1: -351.386
    Avg. within centroid distance_cluster_2: -410.075
    Avg. within centroid distance_cluster_3: -384.852
    Avg. within centroid distance_cluster_4: -403.787
    Avg. within centroid distance_cluster_5: -371.171
    Avg. within centroid distance_cluster_6: -366.001
    Avg. within centroid distance_cluster_7: -402.358
    Davies Bouldin: -0.500

    And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...

    Thanksss so much in advance!




  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @joen841030,

    Why did you think that theses results are incorrect ?

    Regards,

    Lionel
  • joen841030joen841030 Member Posts: 8 Contributor I
    Hi @lionelderkrikor,
    Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!
Sign In or Register to comment.