Options

"Kmeans clustering in Text data"

ratheesanratheesan Member Posts: 68 Maven
edited June 2019 in Help
Hi,

After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.

Thanks
Ratheesan.

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    KMeans uses some properties of the euclidean distance to simplify the KMedoids algorithm. This speeds up calculation, but limits the distance measure to be euclidean. Normally euclidean distance is not the best for high dimensional data text data. Usually the cosine similarity is used. But if you receive meaningful results, everything should be fine and you might go ahead with KMeans.

    Greetings,
      Sebastian
Sign In or Register to comment.