Hi! I am trying to perform a cluster analysis on a dataset of some 35.000 tweets in order to try and find clusters talking about similar sub-topics. I am not entirely sure how to approach this. So far I have tried using a DBSCAN clustering but it has essentially just returned with one giant cluster. Do I need to use another clustering method or somehow pre-define a number of clusters using the most common words? This is planned as an extension to my thesis and I am new to clustering and RapidMiner so any help would be greatly appreciated.
K-means clustering will allow you to specify the number of clusters (k) in advance, so depending on what you would like to see you could try that. Or X-means will test a wide range of values of k and then return the value that it believes is "best" (see the operator documentation for details). I would start with those two and see what you get.
Going down the road of labeling some examples based on keywords/concepts can be fruitful but takes you out of unsupervised clustering into more of the semi-supervised or fully supervised learning, which does give you many other options in terms of predictive models but is fundamentally a different type of problem.
Thank you very much!
I am just not very sure how to arrive at a proper number of clusters for a k-means analysis. What I am trying to do is to determine the most talked about sub-concepts of my topic (smart cities). Would you say it would make sense to, for example, look at the top keywords and see which of these pertain to different concepts and then set that as a value for k? If I wanted to do a more supervised run, how would I label examples?
Again, thank you for your help.
I would recommend first running X-means and seeing what it recommends as the "optimal" number of clusters.
Then you could turn those clusters into labels and see what characteristics define them using a predictive modeling approach like a Decision Tree. (See this thread for an example process that will help you with this step: http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-find-the-traits-of-each-cluster/m-...).
If you wanted to go the entirely supervised route, you would start by coming up with your own categories---I would encourage you to keep to a relatively small number, like 5 or less---and then hand-label enough cases (probably several dozen of each category to start) to build a predictive model, which would be a polynominal classification model. Then you could apply that model to the larger set of data and do some additional profiling.