Clustering with GPS cordinates but now in addition with the population?

CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II
edited November 2018 in Help

Earlier I posted a question about how to cluster buildings with GPS their coordinates. Based on the feedback I managed to get clustering outputs which make sense also in practise. Out of many methods available in RapidMiner, k-means procuded most useful results.

 

Now I would like to extend the clustering by taking the population in those building involved in to the clustering. Not as clustering attribute, but by defining the population min-max number for the clusters. So that for example the average of population in all the clusters is between  300 to 500.

Is there any ways to define this kind of process?

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    have you considered to take the population as a weight? This should yield to something very similar.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Thank you for the reply.

     

    I have not considered but will test how weighting with population works in this specific situation.

     

    To be exact, how should I proceed with "population as a weight, meaning what operator(s) should I use?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    use Set role and set your pop attribute to role weight. Afterwards make sure that the used clustering algorithm support weights.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    K-means clustering does support weights.  However, I don't think k-means by itself will do what the original request was asking for because based on my understanding of k-means, it does not do anything to ensure that the resulting clusters are the same size (whether weighted or unweighted). @mschmitz am I missing something about the algorithm?

    So if you want to constrain each cluster to have a minimum and maximum weighted size, how would you implement those constraints with k-means?

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Thank you for the feedback.

    Regarding the weighting, I did not see any difference in the clustering results, with weighting versus without it. I tested with k-means and k-means (kernel).  I think we are talking in this case about sample weighting, not attribute weighting?

  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    I tend to agree with  Telcontar120 that sample weighting in connection with k-means clustering is not a fruitful way to "regulate" the clustering results. At least in my special case, when clustering the buildings based on their GPS coordinates but so, that the population in each cluster will be on the average, say 300.

     

    Proposals how to proceed...or should I give up

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I am not a clustering expert, so comments from others are welcome here @mschmitz

    If you know the total population you have represented, and you know you want each cluster to be between 300-500, what does that imply about the total number of clusters you are looking for?

    It is possible that you may be able to approximate a solution using "DBSCAN with weights" from the Mannheim toolbox extension. That operator will at least interpret the weights as instance counts, although you will need to play with the epsilon parameter values to see whether you can get it to produce a set of clusters that satisfy your conditions.

    Otherwise, you may need to program your own routine to do this (perhaps in python?) since it is not really a conventional clustering problem.  In this case, you basically want to impose constraints on weighted cluster size, iteratively generating clusters based on proximity but that are neither too small nor too large.  In theory doing something like k-means although making the size constraints override the proximity metrics.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Excellent thinking!

    thanx

Sign In or Register to comment.