Options

"placing new instances in clusters using cluster model"

peppep Member Posts: 7 Contributor II
edited May 2019 in Help

Hi guys
In addition to clustering a dataset, RapidMiner can produce, store in repositories, and write in files cluster models. But how can an already built cluster model be used on a compatible but distinct dataset pls? I presume this is possible, due to the existence of cluster models. For instance if one wanted to place each new instance in an appropriate cluster, how can this be done in a process? Cheers!
Tagged:

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    I am not sure if really all cluster models support this, but at least for the centroid-based models (K-Means, K-Medoids), you could simply use the operator "Apply Model". This is in perfect analogy to supervised models. You will find a simple example below.

    Cheers,
    Ingo

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="224" width="614">
          <operator activated="true" class="generate_data" compatibility="5.1.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="120">
            <parameter key="target_function" value="gaussian mixture clusters"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="5.1.008" expanded="true" height="94" name="Split Data" width="90" x="179" y="120">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.1.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="30"/>
          <operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Model" width="90" x="447" y="120">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="90"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    peppep Member Posts: 7 Contributor II
    Thank you. Makes sense.
    In the case of other schemes (DBSCAN) the model applier works too, but it asks for the id, which it seems to match with the id of examples from the originally clustered dataset to retrieve the cluster. Obviously it makes less sense, so in this case it is sounder to: cluster the original dataset, then apply 1- or 3 -nearest neighbour learner with the cluster attribute as label, and then apply its model on the second dataset to get its examples placed in clusters via classification.
    Cheers.

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yip, I totally agree. For clustering schemes like DBScan or agglomerative clustering and others, it is probably much better to learn a supervised model from the clustered data and apply this one to the new data.

    Cheers,
    Ingo
Sign In or Register to comment.