Using Map Clustering on Labels to align cluster predictions with ground-truth

nicolas_richardnicolas_richard Member Posts: 5 Contributor II
edited October 1 in Help
Hello everybody,

My name is Nicolas, I'm a french student working on a research study.

To introduce my problem here is a little background : My topic relates to the discovery of communities in social networks. I'm doing this by clustering the graph of friends of a given user.

Once I clustered the graph I need to align the discovered clusters on the ground truth communities in order to evaluate my method. In simpler words: I want to compare the results of a clustering function to the real clusters. In order to do so I need to find which discovered cluster matches which real cluster.

I'm trying to use the "Map Clustering on Labels" function.

I've created a simple example to get to know this function.
1. I generate some data
2. I multiply this data source
3. I apply to different clustering algorithm on it. (K-means and K-medoids, searching for 3 clusters)
=> I want to compare the results (Ideally, find the precision and recall for each cluster as well as the Balance Error Rate)
4. I use "Extract Cluster Prototypes" to  convert the result of my second clustering from a "model" to an "example set".

My problem is that it doesn't work as expected and I don't understand the error I get.

This make me realize that I may not really understand how to use "Map Clustering on Labels" properly. I think it's the most appropriate function to do what I want to do, maybe it's not. All your remarks will be greatly appreciated.


Here is my process :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="431" width="1016">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|a4|a3|a2|a1"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="30">
        <parameter key="k" value="3"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
      </operator>
      <operator activated="true" class="replace" compatibility="5.2.008" expanded="true" height="76" name="Replace" width="90" x="313" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="replace_what" value="id_(.*)"/>
        <parameter key="replace_by" value="$1"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types" width="90" x="447" y="300">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="guess_types" compatibility="5.2.008" expanded="true" height="76" name="Guess Types (2)" width="90" x="447" y="165">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="join" compatibility="5.2.008" expanded="true" height="76" name="Join" width="90" x="581" y="120">
        <list key="key_attributes"/>
      </operator>
      <operator activated="true" class="map_clustering_on_labels" compatibility="5.2.008" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="715" y="30"/>
      <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="849" y="30"/>
      <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Select Attributes" from_port="original" to_op="Replace" to_port="example set input"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Guess Types (2)" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
      <connect from_op="Guess Types" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Guess Types (2)" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Guess Types (2)" from_port="original" to_port="result 2"/>
      <connect from_op="Join" from_port="join" to_op="Map Clustering on Labels" to_port="example set"/>
      <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


This is the error I get : (from the log file)

"
Feb 20, 2013 12:20:15 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 20, 2013 12:20:15 PM SEVERE: Here:          Process[1] (Process)
          subprocess 'Main Process'
            +- Generate Data[1] (Generate Data)
            +- Multiply[1] (Multiply)
            +- Clustering (2)[1] (k-Medoids)
            +- Extract Cluster Prototypes[1] (Extract Cluster Prototypes)
            +- Clustering[1] (k-Means)
      ==>  +- Map Clustering on Labels[1] (Map Clustering on Labels)
Feb 20, 2013 12:20:15 PM SEVERE: java.lang.NullPointerException
"

Thanks a lot for your help.

Nicolas
Tagged:

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458   Unicorn
    Hello

    It looks like you are doing exactly the right thing. I simplified your process - the special attributes have no effect on the clustering so they can be retained and this avoids the multiplying and joining.

    I've attached it below
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve" width="90" x="112" y="30">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.3.005" expanded="true" height="76" name="Clustering" width="90" x="313" y="30">
            <parameter key="k" value="3"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          </operator>
          <operator activated="true" class="map_clustering_on_labels" compatibility="5.3.005" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="514" y="30"/>
          <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance" width="90" x="715" y="30"/>
          <connect from_op="Retrieve" from_port="output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Map Clustering on Labels" to_port="example set"/>
          <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • nicolas_richardnicolas_richard Member Posts: 5 Contributor II
    Thanks a lot, I'll give this a try tomorrow.

    Actually, while searching for an answer I've came across your blog earlier today from a post on Stack Overflow. Your code helped me but not enough :D

    Hope your reply we'll be ok on my config.

    Nicolas.
  • nicolas_richardnicolas_richard Member Posts: 5 Contributor II
    Hi,

    I tried to manage my way with your example. Of course your example work fine. But, trying to copy this model on my own example didn't work out well. Nonetheless your example really helped to figure out the root of the problem. I think the root of my problem is about the labeling of the data.

    I infer this from the examination of the data as you can see in these two screen prints.

    The way data in your example is labeled : http://imageshack.us/photo/my-images/703/irisvc.png/

    The way the data is labeled in my example : http://imageshack.us/photo/my-images/543/myproblem.png/

    I tried to modify this using "exchange role" and "set role". Didn't work either.

    I think I need a better understanding of how data and labels are handled.

    What do you think ?

    Thanks again.

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458   Unicorn
    Hello,

    It's probably because the label is of type real so it's therefore not possible to map to a nominal cluster.

    Regards

    Andrew
Sign In or Register to comment.