Data Mining and KMeans

SilenceSilence Member Posts: 7 Contributor II

I'm new to data mining, and I'm currently interested on learning kMeans... and I've got some questions for you guys.

My sample dataset consists of 49 records, each having 60 attributes/values.
I want to learn how the computation and assignment for the means/centroids is done.

I would also like to ask if my operators for this clustering algorithm are correct:
|__AccessSampleSource (I chose this one because my database format is MS Access 2003)
|__MissingValueReplenishment (set to zero)

For the visualization, I  always choose Scatter Multiple, having the x-axis as the cluster, and some of the attributes (usually 15 attributes) as the y-cluster.

Am I doing it right?

I hope someone could enlighten me soon!

Thank you, and more power to RapidMiner! =)


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    for understanding how K-Means works, I would suggest reading the respective wikipedia entry

    Anyway your process setup seems to be quite useful for this setting, but I would change the missing value replenishment method to use the mean value. This way the missing values will differ least from the other values during distance calculation. Otherwise examples with missing values could be assigned to a single cluster, just because they have missing values.

    If you find this visualization helpful go on, but I guess using two attributes for x and y axis and using the color for the cluster assignment is much more intuitive.

    As a general hint I would suggest to upgrade to RapidMiner 5, which has a lot more power :)

  • SilenceSilence Member Posts: 7 Contributor II
    I'm happy to know that my process setup is correct! :D

    With regards to  K-Means, Most examples on web sites have only two attributes, and it's easy  to visualize or learn how they have done the process (from the selection of centroids to the grouping of records).
    I really want to simulate (manually) the K-Means process with my own  data set, but I don't know how (or where)  to start because of its 60 attributes. And that alone leaves me confused. How will I evaluate this kind of data set?                            

    p .s.

    Thanks for the answers on my first post, Sebastian. :)


    Finally found a website whose example has multiple attributes.
  • SilenceSilence Member Posts: 7 Contributor II
    A follow up question.

    Is the SVD Reduction operator necessary for every k-Means process?

  • SilenceSilence Member Posts: 7 Contributor II
    Another question folks~

    In RapidMiner's k-Means algorithm, are the centroids randomly selected per iteration?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    of course the SVDReduction is not obligatory for clustering! It's just a method for reducing the dimensionality of your data set. There are many more methods for this like PCA, ICA and so on, but you don't need one at all. It might be useful, depending on your data, but it might also hurt. It will be extremely hard to draw conclusions from a clustering of reduced data, because you don't have any original attribute left.

    To your second question:
    Yes, RapidMiner initializes a KMeans run with a random centroid. To avoid having it lying outside the boundaries of data, one example is chosen as centroid per cluster.

Sign In or Register to comment.