create clusters of the same/similar size

dparaskevopdparaskevop Member Posts: 11 Contributor II
edited June 2019 in Help

Hello all, 


How can I preserve a balance among my clusters? Eg groups of 10 people, with similar characteristics. At the moment I get clusters with 18 people and clusters with 3 people on the same data set, when I use k-means. Can I somehow restrict the number of objects per cluster? 


Many thanks, 




  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    This requirement isn't a classic application of clustering based on machine learning algorithms, and while there are some constrained clustering algorithms out there that can do what you want, I am not aware of any that are implemented in RapidMiner clustering operators (although I'd love to see one because this question does get asked from time to time).   You might be able to find something in R or python that could be used within RapidMiner though.

    Alternatively, there are of course operators for simple binning by frequency, so you could come up with some kind of synthesized attribute combining the values of other attributes and then create groups based on that.

    @mschmitz any other thoughts on this one?


    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @dparaskevop,


    Interesting topic.

    How suggested by @Telcontar120, there are ressources on internet. So, I did not reinvent the yarn to cut the butter and you can find here a process using a Python script (via the Execute Python operator) : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="8.1.003" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34">
    <parameter key="number_of_attributes" value="2"/>
    <parameter key="attributes_lower_bound" value="0.0"/>
    <parameter key="attributes_upper_bound" value="1.0"/>
    <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="label"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Execute Python" width="90" x="380" y="34">
    <parameter key="script" value="import numpy&#10;import pandas as pd&#10;from scipy.spatial.distance import pdist, squareform&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; #Parameters#&#10; K = None # number of clusters&#10; G = 30; #group size&#10;&#10; def error(K, m, D):&#10; &quot;&quot;&quot;return average distances between data in one cluster, averaged over all clusters&quot;&quot;&quot;&#10; E = 0&#10; for k in range(K):&#10; i = numpy.where(m == k)[0] # indices of datapoints belonging to class k&#10; E += numpy.mean(D[numpy.meshgrid(i,i)])&#10; return E / K&#10; &#10; #numpy.random.seed(0) # repeatability&#10; N, n = data.shape&#10;&#10; if G is None and K is not None:&#10; G = N // K # group size&#10; elif K is None and G is not None:&#10; K = N // G # number of clusters&#10; else:&#10; raise Exception('must specify either K or G')&#10; D = squareform(pdist(data)) # distance matrix&#10; m = numpy.random.permutation(N) % K # initial membership&#10; E = error(K, m, D)&#10; &#10; t = 1&#10; while True:&#10; E_p = E&#10; for a in range(N): # systematically&#10; for b in range(a):&#10; m[a], m[b] = m[b], m[a] # exchange membership&#10; E_t = error(K, m, D)&#10; if E_t &lt; E:&#10; E = E_t&#10; else:&#10; m[a], m[b] = m[b], m[a] # put them back&#10; &#10; if E_p == E:&#10; break&#10; t += 1 &#10; &#10; cluster = [] &#10; &#10; for i in range(N): &#10; cluster.append(m[i])&#10;&#10; Cluster = pd.DataFrame(data = cluster,columns = ['cluster'])&#10; data = data.join(Cluster) &#10; &#10;&#10; # connect 1 output pors to see the results&#10; return data"/>
    <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <connect from_op="Execute Python" from_port="output 2" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>


    For example, below, the results of clustering of a "school" dataset with 2 attributes : 

     - 100 examples chosen at random in range [0,1].

     - 30 examples per cluster.



    In practice, this script can be generalizable to a space of dimension n. (to be applied to your project of people caracteristics).


    I hope that these elements will be useful to you.










  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Very slick!  When RapidMiner native operators fail you can always count on @lionelderkrikor to come to the rescue with a clever Python script!

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.