Options

"which is better for k-means ? CorrelationSimilarity or EuclideanDistance?"

kimfengkimfeng Member Posts: 5 Contributor II
edited June 2019 in Help
k-means algorithm in version 5.0 uses EuclideanDistance as distance. but i think CorrelationSimilarity is better .  what do you think ?

Answers

  • Options
    wesselwessel Member Posts: 537 Maven
    Depends on your problem.

    According to the no free lunch theorem they are equally good.

    According to Tom Mitchell, they choice you make influences the bias of the k-NN leaner.
    You should make the choice such that the bias matches your problem best.
  • Options
    kimfengkimfeng Member Posts: 5 Contributor II
    That's right. Thank you Wessel !  :)
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    another note since you asked for k-means: k-means always uses Euclidean distance, there is no other option since this distance measure is directly connected to the fitness function k-means tries to optimize. For this reason it can work in time O(n log n) only. If you want to use different similarity measures (there are dozens available in RapidMiner), you have to use k-medoids which is slower and has a quadratic runtime.

    Cheers,
    Ingo
  • Options
    nguyenxuanhaunguyenxuanhau Member Posts: 22 Contributor II
    Hi
    The work time of k-mean is O(nkt), which n is number of objects, k is number of clusters and t is number of iteration. not O(n log n)
    Thanks
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    I am actually no expert for runtime analyses at all - and I certainly do not want to open this old thread again - but I think the question is if the number of iterations "t" is a function of "n", right? As far as I remember, this was in average indeed the case:

    http://doc.utwente.nl/70194/1/FOCS2009_ArthurEA_kMeans.pdf

    For K-Medoids, all similarities have to be calculated at least once (hence the quadratic term) which is not necessary for the fitness function in K-Means derived from the euclidean distance. In fact, I would expect that the runtime of K-Medoids indeed is as well slower than quadratic and would expect something like (On*n*log(n)) in analogy to the runtime analysis in the paper above.

    However, I dit not look into the details of the paper and just remembered that I have heard about if before. But maybe this helps somebody...

    Cheers,
    Ingo
Sign In or Register to comment.