Joining datasets on a non-distinct attribute

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
edited November 2018 in Help

Hi rapidminers, 

 

Let's say I need to JOIN 2 datasets, A and B, on an attribute 'key'. 

It is distinct for dataset A (each distinct value corresponds to only one example), but not for B (there are few examples with the same value of 'key' in B)

 

What logic of joining will RapidMiner follow in this case? For each join 'A <--> key <--> B' what exactly example will be taken from B? First, last, random, or it totally depends on how dataset was stored in repo so it's not possible to know at all? 

 

Thank you. 

Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted

    So the short version is that it will REPEAT all of the id examples with the different B values:

     

    Data Set 1             Data Set 2

    ID    Last               ID     First  

    1      JONES          2      Lynn

    2      HOLMES       2      Lucile

    3      SILVA            2       Leonardo

     

    After inner join:

    ID    Last                First

    2      HOLMES       Lynn

    2      HOLMES       Lucile

    2      HOLMES       Leonardo

     

    After outer join:

    ID    Last                First

    2      HOLMES       Lynn

    2      HOLMES       Lucile

    2      HOLMES       Leonardo

    1      JONES          ?

    3      SILVA             ?

     

     

    left join is same as inner join in this case

     

    After right join:

    ID    Last                First

    1      JONES          ?

    2      HOLMES       Lynn

    2      HOLMES       Lucile

    2      HOLMES       Leonardo

    3      SILVA             ?

     

    Clear as mud?  :)

     

    Scott

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    So first it depends on which join you’re doing: inner, outer, left or right. You will most likely get different results depending on the data. Which one are you doing?
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @sgenzer

     

    This one is actually an inner join (the most used one, though others might be occasionally present as well).

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Clear, Captain.

    Thanks :) 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Captain Morgan!

Sign In or Register to comment.