Cross Distances operator : Weird results

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited December 2018 in Help

Hi, 

 

I allow myself to create a dedicated topic for a subject that has not been answered in a previous topic.

In this previous topic, the goal was to calculate the similarity between "employees caracteristics" and "a position".

I decided to use the Cross Distances operator, but I got weird results : 

The calculated similarity is always the same regardless of the "position" and "employees caracteristics".

I performed some tests without results and this topic running through my mind.

 

NB : I used Read Excel operator to introduce my example sets.

 

You can find my process here : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Employees" width="90" x="45" y="85">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\HR_Sourcing\Employees.xlsx"/>
<parameter key="imported_cell_range" value="A1:F5"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Id_employee.true.integer.id"/>
<parameter key="1" value="name.true.polynominal.attribute"/>
<parameter key="2" value="skills.true.polynominal.attribute"/>
<parameter key="3" value="department.true.polynominal.attribute"/>
<parameter key="4" value="language.true.polynominal.attribute"/>
<parameter key="5" value="experience.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Id_employee|department|experience|language|skills"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="name|Id_employee"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Position" width="90" x="45" y="238">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\HR_Sourcing\Employees.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:E2"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Id_position.true.integer.id"/>
<parameter key="1" value="skills.true.polynominal.attribute"/>
<parameter key="2" value="department.true.polynominal.attribute"/>
<parameter key="3" value="language.true.polynominal.attribute"/>
<parameter key="4" value="experience.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="179" y="238">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="department|experience|language|skills|Id_position"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" breakpoints="before" class="cross_distances" compatibility="8.0.001" expanded="true" height="103" name="Cross Distances" width="90" x="447" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="compute_similarities" value="true"/>
</operator>
<operator activated="true" class="rename" compatibility="8.0.001" expanded="true" height="82" name="Rename" width="90" x="581" y="85">
<parameter key="old_name" value="document"/>
<parameter key="new_name" value="Employee"/>
<list key="rename_additional_attributes">
<parameter key="request" value="position"/>
<parameter key="distance" value="similarity"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role (3)" width="90" x="715" y="85">
<parameter key="attribute_name" value="Employee"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="join" compatibility="8.0.001" expanded="true" height="82" name="Join" width="90" x="849" y="136">
<list key="key_attributes"/>
</operator>
<connect from_op="Employees" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (3)" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Select Attributes (3)" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Position" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Cross Distances" from_port="result set" to_op="Rename" to_port="example set input"/>
<connect from_op="Cross Distances" from_port="request set" to_port="result 3"/>
<connect from_op="Cross Distances" from_port="reference set" to_port="result 1"/>
<connect from_op="Rename" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

My (fictive) example sets can be downloaded by following this link : 

https://drive.google.com/open?id=18JFovsp_pk7l-1SNx-oeywdwzVSeG-r0

 

Is it a bug ? if not can you tell me what I missed/forgot?

 

Thanks you for your responses,

 

Regards, 

 

Lionel

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I wasn't able to retrieve your dataset to check this, but if your attributes are both nominal and numerical, then the distance metric will be "Mixed Euclidean" which sets differences in nominal categories to equal 1 if they are not the same and 0 if they are the same.  That can often lead to identical differences regardless of the specific values that are contained.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi,

     

    Thanks you for your feedback @Telcontar120. In deed, my four attributes are both nominal (3) and numerical (1): 

     - skills, departement and language : nominal

     - experience : numerical

     

    As proposed, I used the "MixedEuclideanDistance". However, when a position and employee caracteristics are strictly

    the same, the distance is different from 0 (here the Id_employee = 3) it seems that RapidMiner don't detect that the nominal attributes are equals in the position and the employee caracteristics.

     

    Here the employee caracteristics of my example set : 

    HR_Sourcing_1.png

    Here the position : 

    HR_Sourcing_2.png

    and here the results : 

    HR_Sourcing_3.png

    NB : My nominal attributes are imported as "Nominal" via Read Excel operator

     

    What have I missed / forgotten ?

     

    Thanks you,

     

    Regards, 

     

    Lionel

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    OK, agreed, that looks unusual!  Did you make sure to "Trim" your nominals?  It could be errant (and invisible) leading or trailing spaces are causing a mismatch when it looks like they should match.  One other point is to make sure the spelling is exactly the same on the nominal attributes (I noticed for example that "engineering" is misspelled in the examples you have shown, but maybe it is not misspelled everywhere?) Other than that, I have no idea why you would get the results you are seeing.  Maybe @mschmitz has an idea?

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi,

     

    Thanks you for your feedback @Telcontar120.

    1. I did'nt know the Trim operator, but unfortunately, it does not change the results of the process.

    2. The (mis)spelling (English not fluently spoken.......) is strictly the same on the nominal attributes between the position and the employee caracteristics  : I did a copy/paste between the two example sets).

     

    Best regards, 

     

    Lionel

     

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi,

    I took a look at your process from above, where the chosen Similarity Measure is "Cosine Similarity", which is a plain numeric measure.

    So, RapidMiner would be right in just computing the similarity with using the single numerical attribute. However that doesn't match what we see there and the cosine would also not really be different if we just have one axis.

    To do it correctly you will need to change the nominal attributes into numerical ones. Use Dummy Encoding if you want to use Cosine Similarity.
    You can try with mixed euclidean as well, then experience attribute might dominate the distance as it's possible distances are 0 to 4 while all others are 0 to 1.

     

    Greetings,

    Sebastian

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi,

     

    Thanks you for your feedback @land.

    I experimented the process by dummy encoding the nominal attributes.

    But RapidMiner don't perform the calculations of the distances/similarity.

     HR_Sourcing_4.png

    I think it's because the number of attributes is different in the Employee caracteristics example set and in the Position example set :

     1. Here the "dummy encoded" Position example set : 

    HR_Sourcing_5.png

     2. and here the "dummy encoded" Employee caracteristics example set : 

    HR_Sourcing_6.pngpmpm

    What do you think ?

     

    Concerning mixed euclidean, I experimented it and how said in the previous post, I don't understand why for a Position and employee caracteristics which are strictly the same, the associated distance is different from "0".

     

    Best regards, 

     

    Lionel

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @land I agree with @lionelderkrikor here.  After looking at his examples, regardless of the distance metric used, I cannot understand why the cross-distance would be greater than 0 if all the attributes have the same values.  Can you clarify?  Or perhaps @sgenzer can ask one of the developers to take a look at this in more detail?  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi all -

     

    so I played with this a bit and there is something wonky with the two excel docs coming in. If you simply multiply one sheet and then filter out one row as reference, it works just fine:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" breakpoints="after" class="subprocess" compatibility="7.6.001" expanded="true" height="103" name="Subprocess" width="90" x="45" y="850">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="179" y="30">
    <list key="attribute_values">
    <parameter key="attribute1" value="1"/>
    <parameter key="attribute2" value="2"/>
    <parameter key="attribute3" value="3"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="165">
    <list key="attribute_values">
    <parameter key="attribute1" value="1"/>
    <parameter key="attribute2" value="2"/>
    <parameter key="attribute3" value="3"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="179" y="300">
    <list key="attribute_values">
    <parameter key="attribute1" value="4"/>
    <parameter key="attribute2" value="5"/>
    <parameter key="attribute3" value="6"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="103" name="Append" width="90" x="313" y="210"/>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID" width="90" x="514" y="30">
    <parameter key="create_nominal_ids" value="true"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.6.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="514" y="210">
    <parameter key="create_nominal_ids" value="true"/>
    </operator>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_port="out 1"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_port="out 2"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="162"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="cross_distances" compatibility="7.6.001" expanded="true" height="103" name="Cross Distances" width="90" x="246" y="850">
    <parameter key="numerical_measure" value="KernelEuclideanDistance"/>
    </operator>
    <operator activated="true" class="read_excel" compatibility="7.6.001" expanded="true" height="68" name="Employees" width="90" x="45" y="187">
    <parameter key="excel_file" value="/Users/genzerconsulting/Desktop/Employees.xlsx"/>
    <parameter key="imported_cell_range" value="A1:F5"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Id_employee.true.integer.id"/>
    <parameter key="1" value="name.true.polynominal.attribute"/>
    <parameter key="2" value="skills.true.polynominal.attribute"/>
    <parameter key="3" value="department.true.polynominal.attribute"/>
    <parameter key="4" value="language.true.polynominal.attribute"/>
    <parameter key="5" value="experience.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="187">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="name"/>
    <parameter key="attributes" value="Id_employee|department|experience|language|skills"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
    <operator activated="true" class="filter_example_range" compatibility="7.6.001" expanded="true" height="82" name="Filter Example Range" width="90" x="447" y="289">
    <parameter key="first_example" value="3"/>
    <parameter key="last_example" value="3"/>
    </operator>
    <operator activated="true" breakpoints="before" class="cross_distances" compatibility="7.6.001" expanded="true" height="103" name="Cross Distances (2)" width="90" x="581" y="187">
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Cross Distances" to_port="request set"/>
    <connect from_op="Subprocess" from_port="out 2" to_op="Cross Distances" to_port="reference set"/>
    <connect from_op="Employees" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances (2)" to_port="request set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Cross Distances (2)" to_port="reference set"/>
    <connect from_op="Cross Distances (2)" from_port="result set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="90"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @sgenzer,

     

    Thanks you for your feedback.

    Unfortunately, the problem does not come from Excel files.

    In deed, with "CSV files" (see attached files), the results of the process are strictly the same as with the Excel files.

    But thanks to your test, a priori, we can conclude that the problem come from the Ids of the files.

    In deed, that 's the only one difference between your test process and my process. (and the only one difference between the Employee example set and the Position example set in my process).

    So is there any possibility that the Ids are taken into account in the calculation of similarity/distances ?

     

    Thanks you for your response

     

    Best regards, 

     

    Lionel

     

     

     

     

Sign In or Register to comment.