RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

"Kmeans clustering in Text data"

ratheesanratheesan Member Posts: 68  Maven
edited June 2019 in Help
Hi,

After applying string tockenizer,stopword filter and tockenlength filter on text data after selecting "Binary occurrence" we are getting all words as numerical attributes and its binary values.My doubt is after selecting these numerical attributes only can we apply KMeans clustering.I tried this method using my data and got a meaningful cluster.But actually I dont know whether it is a good method for text data.More over comparing with KMedoids it consuming very less time.

Thanks
Ratheesan.

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    Hi,
    KMeans uses some properties of the euclidean distance to simplify the KMedoids algorithm. This speeds up calculation, but limits the distance measure to be euclidean. Normally euclidean distance is not the best for high dimensional data text data. Usually the cosine similarity is used. But if you receive meaningful results, everything should be fine and you might go ahead with KMeans.

    Greetings,
    Β  Sebastian
Sign In or Register to comment.