Text clustering and labeling
I'm using Rapidminer for text clustering (kmeans) and then labeling the clusters. We have usually around 2000 documents and the texts are in German. The texts are short (title and short description of news or articles) and so far Rapidminer is working nice! In the text processing phase, I use Term Frequency vectors, instead of so commonly used TF-IDF, as I feel Term Frequency in our case works better.
I have now some questions.
- How can we label the clusters nicely? Like human readable titles.
- After running K-means, how can I see the top relevant document in a cluster? (As a try, I want to use simply the title of this document as the title of the whole cluster)
- I have trained a classification model (KNN), to put first the documents in some known groups(politics, sport, etc), and the run the main clustering process on documents in each group; to achieve a nicer two level clustering. But I don't know how I can connect the result of the classification process to the clustering process in order to have the whole process automated (instead of running the classification, and then marking the documents of each group manually and then running clustering on those documents)
Thank you in advance for your help.