The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Cluster backetball players based on their performance...
I have a dataset with many players and their performance for the season.
My goal is to cluster them into 3 or more groups based on their performance, like high, average, low performance etc..
The attributes are like positions, ave points, steals, mistakes, blocks, running distance etc....
It probably will be some analysis to do with k-means I guess. But I don't think I will need all attributes to do the clustering. And the other task is to find out which few attributes can be used to split the players.
I am still very new to RapidMiner. And thanks for all the help from you guys.
If anyone can point me the direction to achieve it, that will be great. And I am open to any extensions.
Thanks.
My goal is to cluster them into 3 or more groups based on their performance, like high, average, low performance etc..
The attributes are like positions, ave points, steals, mistakes, blocks, running distance etc....
It probably will be some analysis to do with k-means I guess. But I don't think I will need all attributes to do the clustering. And the other task is to find out which few attributes can be used to split the players.
I am still very new to RapidMiner. And thanks for all the help from you guys.
If anyone can point me the direction to achieve it, that will be great. And I am open to any extensions.
Thanks.
Tagged:
0
Best Answers
-
jacobcybulski Member, University Professor Posts: 391 UnicornIf you were to use k-means then you'd need numerical attributes. Make sure that you select attributes that are independent of each other. While k-means is not a linear model you could use Correlation Matrix to establish independence of attributes - ignore the matrix but look at the weights - the higher the weight, the more (linearly) independent of other attributes (and vice versa). While there are may other way of weighing attributes, one great thing about doing it this way is that you do not need to define a label in this process (we are not predicting anything)
1 -
jacobcybulski Member, University Professor Posts: 391 UnicornThe best use of cluster performance measures is to use them in optimisation in search for the best cluster parameters, e.g. using grid optimisation. A single performance measure will not be useful. Davis-Bouldin is very tricky and I never had much success with it, for DB to work your data must be smooth and convex, smooth as continuous and convex in multidimensional space which is hard to imagine and hard to achieve on real data. If you use DB decide on the range of k that is acceptable for you and pick DB closest to zero in that range, while avoiding peaks and troughs near the minimum (so go for the flat areas around). I often use item Distribution Performance and select sumofsquares as measure (a sort of cluster error from its centre). Then plot SSE vs k and look for the elbow, I. e. the point where increasing the clustering complexity, as given by k, no longer gives any significant gain in performance.
5
Answers
Or is that the result needs to be below 1 to make the clustering a 'good' one?