RapidMiner

0 Likes

Understanding mixed euclidean distance calculation for polynomial and nominal attributes

Status: Investigating

Hi!

I'm aware of some previous posts about how the mixed euclidean distance is calculated. My understanding is that for numeric attributes it is standard euclidean claculation whereas for nominal attributes a distance of 1 is accounted if both values are not the same.

However, I cannot make sense of the results I am getting for a simple example where I have polynomial and nominal attrbutes (which I expected that would be accounted the same way).

 

The data is as follows:

 

REQUEST EXAMPLE

11075FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE

 

REFERENCE EXAMPLES

1128FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
21545FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
31545FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE

 

The first column is the row id, the second column is a class attrbiute (ignored in calculation), the third and fourth columns are polynomial and the rest are binomial.

 

The output is:

 

1.01.00.0
1.02.01.4142135623730951
1.03.01.4142135623730951

 

How can the distance between the request example and the first of the reference examples be zero? Most likely, it is a very obvious calculation but I cannot see it...

 

I would appreciate some help!

My thanks!

 

 

9 Comments (9 New)
Comments
Unicorn

Can you post your XML---it is hard to see how you have your operator configured, and it could be something in the parameter setting (e.g., only looking at nominal and not numerical attributes, etc).  

 

Learner II

Sure! Many thanks for the prompt response.

Unicorn

So I can't see your original data here, but I created a simple test process along the lines you explained.  And everything seems to be working normally here.  Take a look a this process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <parameter key="random_seed" value="2001"/>
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="187">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="id,att1,att2,att3,att4,att5,att6&#10;1,10,7,5,FALSE,FALSE,FALSE"/>
      </operator>
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="id,att1,att2,att3,att4,att5,att6&#10;1,10,7,5,FALSE,FALSE,FALSE&#10;2,1,2,8,FALSE,FALSE,FALSE&#10;3,15,4,5,FALSE,FALSE,FALSE&#10;4,15,4,5,FALSE,FALSE,FALSE&#10;5,10,7,5,TRUE,TRUE,TRUE"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.002" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
        <parameter key="attribute_name" value="id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.0.002" expanded="true" height="82" name="Set Role (2)" width="90" x="246" y="136">
        <parameter key="attribute_name" value="id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="cross_distances" compatibility="9.0.002" expanded="true" height="103" name="Cross Distances" width="90" x="514" y="85">
        <parameter key="nominal_measure" value="DiceSimilarity"/>
        <parameter key="k" value="3"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
      <connect from_op="Cross Distances" from_port="result set" to_port="result 3"/>
      <connect from_op="Cross Distances" from_port="request set" to_port="result 1"/>
      <connect from_op="Cross Distances" from_port="reference set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

This seems to be working as expected.  The record that is a duplicate shows a distance of zero.  The ones that have differences in the 3 numerical attributes are being calculated in the expected way. And the one record that has the same numerical values but 3 different categorical attributes has a distance value of sqrt(3) as expected.

So here are a few ideas for you to troubleshoot in your own setup:

  • Are you sure you have set the role of the id field to ID in RapidMiner?  If not, that will affect the outcome.
  • Are you sure that the fields that are being included in the comparison have same attribute names?  That would also affect the outcome.
  • Make sure all data types are correct (numericals are numeric and categoricals are polynominal).

 

Learner II

There are no duplicate examples nor numeric attributes. I am attaching the data to this post.

I am sure that both sets of examples have the exact number of attributes and that the attributes are named the same, have the same type, and are in the same order. The id is labelled as id, the class is a label, imput and grav are nominal attributes amd the rest of the attributes are bonomial.

Many thanks!

Unicorn

I am confused---in the dataset you supplied, none of the conditions you specified appear to be true!

  • They do not contain the same number attributes: "small-request" has 26 attributes but "small-test10" has 28 attributes (code and age are extras) 
  • In "small-test10" all attributes are of type integer, while in "small-reference" all attributes are nominal or binominal
  • There is only one example in each dataset and other than the extra attributes they do appear to be duplicates
  • There is no id field present, only a label

These discrepancies would certainly explain why you are not getting the expected results.  You should harmonize your datasets in terms of number of attributes and data types, correct discrepancies as needed, and try the operator again. 

 

 

 

Learner II

So sorry, I included three datasets instead of two, hence your confusion. I'm attaching the data agians (also as CSV files) and some screenshots of the data and the statistics as presented in RM.

In short, I have one example (small request) that I want to compare against three examples (small reference).

 

You are right in that the examples have the same values for all the binomial attributes. However, the values for imput and grav (the two polynomial attribs) are not always the same.

How can the distance between the request and the reference #1 be zero if they have different values for these attributes?

 

 

 

 

Unicorn

Yep, I agree, these results are fishy.

@mschmitz might know something more about what is going on with this cross-distance calculation.  It doesn't seem to like those initial polynominal attributes (not the binominal ones).  Is this a bug in the implementation of cross-distance? Or is there some other weird effect going on here that is not obvious?

@sgenzer you might also remember, there was a related problem with cross-distance earlier in the year.  Do you know what ever happened with this thread: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Cross-Distances-operator-Weird-results/m...

It looks like it was simply abandoned, but combined with this thread, it makes me think there is likely a problem with this operator...

  

RM Staff

Hi @alourenco, @Telcontar120,

 

i've ran a few tests and it looks like a bug. I will file a ticket.

 

BR,

Martin

 

CC: @sgenzer

Community Manager
Status: Investigating