Distance to cluster centre for every data point

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
edited November 2018 in Help

Hi guys

 

Not a big expert in clustering and couldn't find suitable solution on the forum, so here's the question.

 

When I perform clustering, is there a simple RapidMiner way to obtain the exact distances to each cluster centre for each and every example in the dataset?

 

For example, if I have cluster1 and cluster2, and cluster1 contains examples v1, v2, v3, how could I find out which one from v1, v2, v3 is the closest (most representative example) or farthest (least representative example) from cluster1 center?

 

Thank you :)

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted

    Hi,

    Can't you do Extract Cluster Centroids + Cross Distance?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @mschmitz

     

    Yes I can :) This seems to be a solution, though not very obvious.

    But this way I guess I am geting indexes of examples (document column) for each cluster number (request column), correct?
    So I will need then to somehow match these indexes with original examples if I want individual distances and not only min / max?

     

    Screenshot 2018-06-21 11.16.28.png

     

    Screenshot 2018-06-21 11.11.49.png  

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    well you get the distance to each centroid. So you would need to throw an aggregate afterwards to figure out the closest cluster centroid.

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Clear @mschmitz

     

    But is there a reason these distances were not included in the default output example set for clustering operators?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    @kypexin,

    you mean all distances or the lowest?

    All distances would increase the memory quite a lot. I can see some reason to get the distance to the assigned cluster as a kind of "confidence"? Is that what you ask for?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    @mschmitz not ALL distances, but as you said, for each example a distance to its 'parent' cluster only. And yes, this can serve as an analog for confidence parameter. 

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    @kypexin

    good question. Especially because at least kmeans specifically calculates the number... @sebastian_land wrote it - so maybe he knows?

     

    And maybe @sgenzer can make a ticket out of this :)

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    @mschmitz

    ok, nice. Seems I have just thrown in some little idea :) 

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    I certainly can. This is a feature request, not a bug - correct?

Sign In or Register to comment.