Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Text Clustering using K-Medoids Algorithm

puteri_prameswaputeri_prameswa Member Posts: 3 Contributor I
edited November 2018 in Help

Hi All!

 

I'm new to RapidMiner. I have 1000+ online reviews generated from Tripadvisor.com. I want to apply K-Medoids algorithm to cluster those reviews into cluster. The reason why I chose K-Medoids bcs I want to find the "medoid" for each cluster, which I believe is able to represent the contents of the entire cluster. I already applied some nodes such as:

- Read Excel

- Select Attributes

- Nominal to Text

- Process Documents from Data (Tokenization, Stemming, Stopwords Removal)

- and the Clustering node itself

 

But I can't seem to find the proporsional cluster. From 1000+ data with k = 2, the ratio of of members of clusters 1 and 2 is 99 : 1. 

 

 

Please help mee!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,528 RM Data Scientist

    Hi,

     

    have you tried to use

     

    a) TF-IDF

    b) cosine similarity as distance measure

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I agree with @mschmitz suggestions.  However, there is no guarantee when using any of the k-means family of clustering algorithms that the clusters will be of equal sizes.  The algorithm isn't looking directly at the cluster sizes, but rather at intra-cluster similarity vs inter-cluster similarity.  You may want to try X-Means which will test a large range of possible k values and suggest the best one based on BIC.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.