🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"Similarity Measure into Clustering"

B_MinerB_Miner Member Posts: 72  Maven
edited June 9 in Help
Hi Guys,

Is it possible to use RM to create a distance matrix (say Jaccard Sim) and use this matrix into a cluster analysis? If so are there any examples?

Thanks!

Brian
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,527   Unicorn
    Hi Brian,
    both is possible. You might create a distance matrix using the Data to Similarity operator and select Jaccard Simularity as distance function. And you might do clustering selecting the same distance function using for example kMedoids.

    Greetings,
      Sebastian
  • B_MinerB_Miner Member Posts: 72  Maven
    Hi Sebastian,

    I tried to hook up a Data to Similarity operator to kmeans and got an error. Is kMedoids the only clustering that can take a distance matrix as input? Example that causes error for type of input into kmeans:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="296" width="280">
          <operator activated="true" class="generate_nominal_data" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="165"/>
          <operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="112" y="30">
            <parameter key="measure_types" value="NominalMeasures"/>
            <parameter key="nominal_measure" value="JaccardSimilarity"/>
          </operator>
          <operator activated="true" class="k_means" expanded="true" height="76" name="Clustering" width="90" x="179" y="165"/>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,527   Unicorn
    Hi,
    K-Means does always use Euclidean distance, it's simply part of the algorithm. In Kmedoids, you might select the distance function, but you cannot forward a similarity matrix. It will calculate the similarities from the given example set as it needs them.

    Greetings,
      Sebastian
Sign In or Register to comment.