KMeans Clustering for Text
Hi RM Team! I have a quck question about application of KMeans clustering for text.
I have a set of ~2000 comments. Once I'm done with Text Processing (using TFIDF) I have a word vector matrix of ~30 terms.
I then apply Kmeans operator, but I wonder what actually serves as input for clustering? Is it vector matrix? If so, does clustering algorythm uses values from TFIDF Word Vectors or some other values?
OptionsTelcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Exactly, it is the word vector matrix that is used. So if you created the vector using TFIDF, it will use those values. You also have the option of using other methods to create the vector like binary term occurrences or term frequency percentage.
Thanks much!
Your cluster will be based on the pruned values of the word vector. If you are interested in the details you should be able to review the actual values for each cluster on the centroid table output of the kmeans operator.
