Re: Cross Distances operator : Weird results

landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
edited December 2018 in Product Feedback - Resolved

Hi Guys,

 

that it works if the data comes from two different sources strongly implicates that there is a problem with the internal representation of nominal values as numerical ids. As this is a big problem and we need to make sure that distance calculation works as expected, I took a look in the code of Version 7.5 as I have this at hand, but I can confirm the problem is still persisting in 8.0.

The bottom line is: The numerical distance measures are broken as they aren't initialized correctly anymore. Their init method is never called any more, so that they treat every single attribute as numerical. So they also calculate a cosine similarity on the nominal attributes using the internal id of the nominal values.

As this id is arbitrary and especially can change when another data set is loaded, there can be arbitrary results. The original init method did a check that there may be no nominal attributes and otherwise raised a UserError message, aborting the process. This is lost, as a new init method was written in a super class, not calling this part of the original code any more.

I would recommend a fast fix from RapidMiner side, as this creates WRONG results, which is even worse than an exception. @sgenzer Would be even worth a hot fix 8.1.001, what do you think?

 

It simply requires that the new method:

public DistanceMeasureConfig init(Attributes firstSetAttributes, Attributes secondSetAttributes)

calls the old method or does what the old method does, which correctly does the checks:

public void init(ExampleSet exampleSet) throws OperatorException

Simple process showing that still nominal values are treated as numerical ones:

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_nominal_data" compatibility="8.0.001" expanded="true" height="68" name="Generate Nominal Data" width="90" x="112" y="34"/>
<operator activated="true" class="generate_nominal_data" compatibility="8.0.001" expanded="true" height="68" name="Generate Nominal Data (2)" width="90" x="112" y="136"/>
<operator activated="true" class="cross_distances" compatibility="8.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="514" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="only_top_k" value="true"/>
<parameter key="k" value="1"/>
<parameter key="compute_similarities" value="true"/>
</operator>
<connect from_op="Generate Nominal Data" from_port="output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Generate Nominal Data (2)" from_port="output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

How you can circumvent the problem for now: Remove any nominal attributes before calculating a numerical distance measure. If you need to incorporate them, transform them into dummy encoding using Nominal to Numerical operator first on the large (reference) data set. Then apply the created preprocessing model (3rd purple port) on the request data set using Apply Model. 

 

 

Greetings,

Sebastian

 

 

Tagged:
3
3 votes

Fixed and Released · Last Updated

RM-3522

Comments

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    thanks, @land. Much appreciated. Pushing to Product Feedback.


    Scott

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

     Pushed to Dev Team.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    dev team Jira ticket RM-3522 created. Will update when available.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @land this has been lingering for a while. Is this still an issue from your end?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Well, I don't know, haven't used the operator for quite some time. But I guess if your dev's fixed that its all right. Seems there have been changes in the file in 9.2.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @land AFAIK the dev team has not closed this ticket from their end. But if it's not affecting community members (like yourself), I'd like to push this thread to "resolved" just to tidy things up.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    @sgenzer,

    After a little test, It seems that the bug is fixed (here in RM 9.4) : 



    The process in attached file.

    Thanks to @land and the dev team for solving this issue.

    Regards,

    Lionel
Sign In or Register to comment.