🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance

CLICK HERE TO DOWNLOAD

Understanding mixed euclidean distance calculation for polynomial and nominal attributes

alourencoalourenco Member Posts: 5 Contributor II
edited December 2018 in Product Feedback - Resolved

Hi!

I'm aware of some previous posts about how the mixed euclidean distance is calculated. My understanding is that for numeric attributes it is standard euclidean claculation whereas for nominal attributes a distance of 1 is accounted if both values are not the same.

However, I cannot make sense of the results I am getting for a simple example where I have polynomial and nominal attrbutes (which I expected that would be accounted the same way).

 

The data is as follows:

 

REQUEST EXAMPLE

11075FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE

 

REFERENCE EXAMPLES

1128FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
21545FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
31545FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE

 

The first column is the row id, the second column is a class attrbiute (ignored in calculation), the third and fourth columns are polynomial and the rest are binomial.

 

The output is:

 

1.01.00.0
1.02.01.4142135623730951
1.03.01.4142135623730951

 

How can the distance between the request example and the first of the reference examples be zero? Most likely, it is a very obvious calculation but I cannot see it...

 

I would appreciate some help!

My thanks!

 

 

Tagged:
0
0 votes

Declined · Last Updated

No activity or votes since Oct 2018. Please comment and cc sgenzer if this should be reopened. RM-3793

Comments

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,558   Unicorn

    Can you post your XML---it is hard to see how you have your operator configured, and it could be something in the parameter setting (e.g., only looking at nominal and not numerical attributes, etc).  

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco Member Posts: 5 Contributor II

    Sure! Many thanks for the prompt response.

    Mosyafa
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,558   Unicorn

    So I can't see your original data here, but I created a simple test process along the lines you explained.  And everything seems to be working normally here.  Take a look a this process:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <parameter key="random_seed" value="2001"/>
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="187">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="id,att1,att2,att3,att4,att5,att6&#10;1,10,7,5,FALSE,FALSE,FALSE"/>
    </operator>
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="id,att1,att2,att3,att4,att5,att6&#10;1,10,7,5,FALSE,FALSE,FALSE&#10;2,1,2,8,FALSE,FALSE,FALSE&#10;3,15,4,5,FALSE,FALSE,FALSE&#10;4,15,4,5,FALSE,FALSE,FALSE&#10;5,10,7,5,TRUE,TRUE,TRUE"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="9.0.002" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="9.0.002" expanded="true" height="82" name="Set Role (2)" width="90" x="246" y="136">
    <parameter key="attribute_name" value="id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="cross_distances" compatibility="9.0.002" expanded="true" height="103" name="Cross Distances" width="90" x="514" y="85">
    <parameter key="nominal_measure" value="DiceSimilarity"/>
    <parameter key="k" value="3"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
    <connect from_op="Cross Distances" from_port="result set" to_port="result 3"/>
    <connect from_op="Cross Distances" from_port="request set" to_port="result 1"/>
    <connect from_op="Cross Distances" from_port="reference set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    This seems to be working as expected.  The record that is a duplicate shows a distance of zero.  The ones that have differences in the 3 numerical attributes are being calculated in the expected way. And the one record that has the same numerical values but 3 different categorical attributes has a distance value of sqrt(3) as expected.

    So here are a few ideas for you to troubleshoot in your own setup:

    • Are you sure you have set the role of the id field to ID in RapidMiner?  If not, that will affect the outcome.
    • Are you sure that the fields that are being included in the comparison have same attribute names?  That would also affect the outcome.
    • Make sure all data types are correct (numericals are numeric and categoricals are polynominal).

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco Member Posts: 5 Contributor II

    There are no duplicate examples nor numeric attributes. I am attaching the data to this post.

    I am sure that both sets of examples have the exact number of attributes and that the attributes are named the same, have the same type, and are in the same order. The id is labelled as id, the class is a label, imput and grav are nominal attributes amd the rest of the attributes are bonomial.

    Many thanks!

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,558   Unicorn

    I am confused---in the dataset you supplied, none of the conditions you specified appear to be true!

    • They do not contain the same number attributes: "small-request" has 26 attributes but "small-test10" has 28 attributes (code and age are extras) 
    • In "small-test10" all attributes are of type integer, while in "small-reference" all attributes are nominal or binominal
    • There is only one example in each dataset and other than the extra attributes they do appear to be duplicates
    • There is no id field present, only a label

    These discrepancies would certainly explain why you are not getting the expected results.  You should harmonize your datasets in terms of number of attributes and data types, correct discrepancies as needed, and try the operator again. 

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco Member Posts: 5 Contributor II

    So sorry, I included three datasets instead of two, hence your confusion. I'm attaching the data agians (also as CSV files) and some screenshots of the data and the statistics as presented in RM.

    In short, I have one example (small request) that I want to compare against three examples (small reference).

     

    You are right in that the examples have the same values for all the binomial attributes. However, the values for imput and grav (the two polynomial attribs) are not always the same.

    How can the distance between the request and the reference #1 be zero if they have different values for these attributes?

     

     

     

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,558   Unicorn

    Yep, I agree, these results are fishy.

    @mschmitz might know something more about what is going on with this cross-distance calculation.  It doesn't seem to like those initial polynominal attributes (not the binominal ones).  Is this a bug in the implementation of cross-distance? Or is there some other weird effect going on here that is not obvious?

    @sgenzer you might also remember, there was a related problem with cross-distance earlier in the year.  Do you know what ever happened with this thread: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Cross-Distances-operator-Weird-results/m-p/46161

    It looks like it was simply abandoned, but combined with this thread, it makes me think there is likely a problem with this operator...

      

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,752  RM Data Scientist

    Hi @alourenco, @Telcontar120,

     

    i've ran a few tests and it looks like a bug. I will file a ticket.

     

    BR,

    Martin

     

    CC: @sgenzer

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,957  Community Manager
Sign In or Register to comment.