Data Mining and KMeans

Silence · March 2010

Greetings!

I'm new to data mining, and I'm currently interested on learning kMeans... and I've got some questions for you guys.

My sample dataset consists of 49 records, each having 60 attributes/values.
I want to learn how the computation and assignment for the means/centroids is done.

I would also like to ask if my operators for this clustering algorithm are correct:
Root
|__AccessSampleSource (I chose this one because my database format is MS Access 2003)
|__MissingValueReplenishment (set to zero)
|__KMeans

For the visualization, I always choose Scatter Multiple, having the x-axis as the cluster, and some of the attributes (usually 15 attributes) as the y-cluster.

Am I doing it right?

I hope someone could enlighten me soon!

Thank you, and more power to RapidMiner!

land · March 2010

Hi,
for understanding how K-Means works, I would suggest reading the respective wikipedia entry http://en.wikipedia.org/wiki/Kmeans.

Anyway your process setup seems to be quite useful for this setting, but I would change the missing value replenishment method to use the mean value. This way the missing values will differ least from the other values during distance calculation. Otherwise examples with missing values could be assigned to a single cluster, just because they have missing values.

If you find this visualization helpful go on, but I guess using two attributes for x and y axis and using the color for the cluster assignment is much more intuitive.

As a general hint I would suggest to upgrade to RapidMiner 5, which has a lot more power

Greetings,
Sebastian

Silence · March 2010

I'm happy to know that my process setup is correct!

With regards to K-Means, Most examples on web sites have only two attributes, and it's easy to visualize or learn how they have done the process (from the selection of centroids to the grouping of records).
I really want to simulate (manually) the K-Means process with my own data set, but I don't know how (or where) to start because of its 60 attributes. And that alone leaves me confused. How will I evaluate this kind of data set?

p .s.

Thanks for the answers on my first post, Sebastian.

edit:

Finally found a website whose example has multiple attributes.

Silence · March 2010

A follow up question.

Is the SVD Reduction operator necessary for every k-Means process?

Silence · March 2010

Another question folks~

In RapidMiner's k-Means algorithm, are the centroids randomly selected per iteration?

land · March 2010

Hi,
of course the SVDReduction is not obligatory for clustering! It's just a method for reducing the dimensionality of your data set. There are many more methods for this like PCA, ICA and so on, but you don't need one at all. It might be useful, depending on your data, but it might also hurt. It will be extremely hard to draw conclusions from a clustering of reduced data, because you don't have any original attribute left.

To your second question:
Yes, RapidMiner initializes a KMeans run with a random centroid. To avoid having it lying outside the boundaries of data, one example is chosen as centroid per cluster.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Data Mining and KMeans

Answers