Options

Where did Average Cluster Distance go?

UnperterbedUnperterbed Member Posts: 3 Contributor I
Hello! I'm teaching intro to ML and it seems like something in RM changed between quarters.

In 9.8.001 the Average Cluster Distance was output to the results window, but doesn't seem to be the case in 9.9.000.
This was with a simple data set (3 integer features), z-transformation normalization, k-means clustering (k=4), and the cluster model visualizer. Other than k=4, the default parameters were used.


I know I can use the Cluster Distance Performance operator, but it was *so* convenient to get that info from the the model visualizer.

Here's my XML, in case you can point out something I'm missing!
<?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve Session 8 -- Student Data" width="90" x="447" y="85">
        <parameter key="repository_entry" value="//Local Repository/Class Demonstrations/Session 8 -- Student Data"/>
      </operator>
      <operator activated="true" class="normalize" compatibility="9.9.000" expanded="true" height="103" name="Normalize" width="90" x="648" y="85">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="method" value="Z-transformation"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="1.0"/>
        <parameter key="allow_negative_values" value="false"/>
      </operator>
      <operator activated="true" class="concurrency:k_means" compatibility="9.9.000" expanded="true" height="82" name="Clustering" width="90" x="782" y="85">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="4"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="determine_good_start_values" value="true"/>
        <parameter key="measure_types" value="BregmanDivergences"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="cluster_distance_performance" compatibility="9.9.000" expanded="true" height="103" name="Performance" width="90" x="983" y="85">
        <parameter key="main_criterion" value="Davies Bouldin"/>
        <parameter key="main_criterion_only" value="false"/>
        <parameter key="normalize" value="false"/>
        <parameter key="maximize" value="false"/>
      </operator>
      <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.9.000" expanded="true" height="103" name="Cluster Model Visualizer (2)" width="90" x="1184" y="85"/>
      <connect from_op="Retrieve Session 8 -- Student Data" from_port="output" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
      <connect from_op="Performance" from_port="example set" to_op="Cluster Model Visualizer (2)" to_port="clustered data"/>
      <connect from_op="Performance" from_port="cluster model" to_op="Cluster Model Visualizer (2)" to_port="model"/>
      <connect from_op="Cluster Model Visualizer (2)" from_port="visualizer output" to_port="result 1"/>
      <connect from_op="Cluster Model Visualizer (2)" from_port="model output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


Answers

  • Options
    ceaperezceaperez Member Posts: 525 Unicorn
    Hi @Unperterbed

    To obtain the average centroid distance, just connect the per output port in your Performance Operator to a res port.

    Best 
  • Options
    UnperterbedUnperterbed Member Posts: 3 Contributor I
    edited April 2021
    Hey @ceaperez,

    Yes, I'm aware of this. My question was whether average cluster distance can still be output by the cluster model visualizer (ie, am I just missing a setting somewhere?) or whether it has been intentionally removed (in which case, please bring it back!)

    From my perspective teaching intro to ML with RapidMiner, every additional step or operator means another 3-5 students get lost at that point. And, in my opinion, cluster distance performance is a particularly confusing operator because of its inconsistent node labels and the need to criss-cross the connectors from the k-means operator.

    Perhaps it'd be better for me to make a feature request, but I wanted to post here here just in case I'm overlooking something obvious.
  • Options
    ceaperezceaperez Member Posts: 525 Unicorn
    Hi @Unperterbed

    you are right, each new operation increase dispersion from the  teaching point of view. I suffer the same problem with some activities.
    Best
Sign In or Register to comment.