"Kmeans clustering in Text data"

ratheesan · December 2009

Hi,

After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.

Thanks
Ratheesan.

land · December 2009

Hi,
KMeans uses some properties of the euclidean distance to simplify the KMedoids algorithm. This speeds up calculation, but limits the distance measure to be euclidean. Normally euclidean distance is not the best for high dimensional data text data. Usually the cosine similarity is used. But if you receive meaningful results, everything should be fine and you might go ahead with KMeans.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Kmeans clustering in Text data"

Answers