keep both join attributes problem

johannesjungblujohannesjungblu Member Posts: 7 Contributor I
edited November 30 in Help

Hello guys, i hope you can help me, i am really struggling with this problem

 

Follwing two simple csv files

File a: 

id;label;cont1
0;       ;   a
1;   1b;   b
2;   2c;   c

 

File b: 

id;label;cont2
0;   0x;   x
1;   1b;   y
2;       ;   z

 

the desired result:

id;label;cont1;cont2
0;   0x;   a;   x
1;   1b;   b;   y
2;   2c;   c;   z

 

Thought way to do: outer join with "keep both join attributes" and no "remove double attributes"

 

But unfortunately i am getting always the same result, no matter the option "remove double attributes" is on or off:

id;label;cont1;cont2
0;     ?;   a;   x
1;   1b;   b;   y
2;   2c;   c;   z

 

Anyone has an idea what i am doing wrong? Tanks!

Tagged:

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 869   Unicorn

    There may be a way to do this with a single join operator, but I'm not sure.  In the meantime, it can be definitely done by treating the two labels as separate variables (you'll need to name them differently) and then creating an attribute that combines them after the join.  See the attached process for an example (the two csv files I created from the simple examples you gave).

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    johannesjungblu
  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    Thanks for the working quick solution and example! 

     

    I am still hoping for a better solution since this only a small example. In my case I have 15 database tables, which i want to join with each 5 common attributes - which means a quite big subprocess for only doing an outer join with keeping both attributes - still thinks its an bug, cause in my opinion this is excact the case for which the setting was made...

     

    My current solution is an outer join and creation of a new table in the database - but this can only work temporary.

  • mschmitzmschmitz Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 1,824  RM Data Scientist

     

    Hi Johannes,

     

    are you sure that all of the attributes you would like to "duplicate" have role regular? Since special roles need to be unique join has the feature to remove the 2nd attribute with the same role.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    Yes, i double checked it.

  • mschmitzmschmitz Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 1,824  RM Data Scientist

    Dear Johannes,

     

    i've built the process with the data you have written down above, and its working for me? Have a look at the attached process. Maybe there is something different in yours?

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="id" value="0"/>
    <parameter key="label" value="str(0/0)"/>
    <parameter key="cont1" value="&quot;a&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="id" value="1"/>
    <parameter key="label" value="&quot;1b&quot;"/>
    <parameter key="cont1" value="&quot;b&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="238">
    <list key="attribute_values">
    <parameter key="id" value="2"/>
    <parameter key="label" value="&quot;2c&quot;"/>
    <parameter key="cont1" value="&quot;c&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="124" name="Append" width="90" x="179" y="85"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Gen Table1</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.2.003" expanded="true" height="82" name="Subprocess (2)" width="90" x="45" y="187">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="id" value="0"/>
    <parameter key="label" value="&quot;0x&quot;"/>
    <parameter key="cont2" value="&quot;x&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (5)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="id" value="1"/>
    <parameter key="label" value="&quot;1b&quot;"/>
    <parameter key="cont2" value="&quot;y&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.003" expanded="true" height="68" name="Generate Data by User Specification (6)" width="90" x="45" y="238">
    <list key="attribute_values">
    <parameter key="id" value="2"/>
    <parameter key="cont2" value="&quot;z&quot;"/>
    <parameter key="label" value="str(0/0)"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="124" name="Append (2)" width="90" x="179" y="85"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (5)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (6)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
    <connect from_op="Append (2)" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Gen Table2</description>
    </operator>
    <operator activated="true" class="join" compatibility="7.2.003" expanded="true" height="82" name="Join" width="90" x="246" y="85">
    <parameter key="remove_double_attributes" value="false"/>
    <parameter key="join_type" value="outer"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="85">
    <list key="function_descriptions">
    <parameter key="RealLabel" value="if(missing(label),label_from_ES2,label)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Built Real Label</description>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes" width="90" x="648" y="85">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="cont1_from_ES2|label_from_ES2|label"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Join" to_port="left"/>
    <connect from_op="Subprocess (2)" from_port="out 1" to_op="Join" to_port="right"/>
    <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 869   Unicorn

    Hey @mschmitz this is basically the same thing as the process I posted--it keeps the two "label" variables separate through the join and then created a combined label using "generate attributes" after that.  But I think @johannesjungblu wants this to be done natively by the outer join operator without additional manipulation afterwards.  

     

    I don't think it is a bug per se, but I tend to agree with him that it is potentially confusing, because the operator help specifies that only the left attribute values are kept if there are double attributes, even if the join type is other than a left join.  So perhaps this is an idea for a product enhancement that could be passed along to the development team to have it take the value of a double attribute based on the join type (left=left value, right=right value, outer=union value, etc.).

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    johannesjungblu
  • mschmitzmschmitz Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 1,824  RM Data Scientist

    Hi Brian,

     

    to be honest, i do not think its a bug. I think its doing what it is supposed to do.

    I think the desired behaviour for Johannes can be archived with a two operators. Loop Attributes + Generate Attributes. Then you can check for everything if it is missing and if so, take the right hand side.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    You are right!

     

    I don´t get the option "keep both join attributes".  The description implies my desired result:

     

    keep both join attributes

    Description: If checked, both columns of a join pair will be kept. Usually this is unneccessary since both attributes are identical. It may be useful to keep such a column if there are missing values on one side.

     

    For what else is this option good?

  • mschmitzmschmitz Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 1,824  RM Data Scientist

    Johannes,

     

    it does what it supposed to do? If you check it, you get both attributes. ID and ID_from_left_ES. That's what its supposed to do, right?

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    ahhhh sorry... I obviously missed that.

     

    Is there a chance a "keep and combine attributes option" will be implemented in the near future?

  • bhupendra_patilbhupendra_patil Employee, Member Posts: 167  RM Data Scientist

    can you elaborate on what you mean by keep and combine,

     

    Not sure if you have seen the advanced options, click on show advanced options and it will allow you to select additonal column handling conditions

     

  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    keep and combine should keep every pair of duplicate attributes in one atribute with a combined policy like: "take the not null entry". 

     

    If you mean the advanced options with remove and keep - yes.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager Posts: 1,867  Community Manager

    hi...I would just rename the labels as label1 and label2 before the join.  You're not joining on label anyway.  Then you avoid this whole issue.  Occam's Razor?

     

    Scott

  • johannesjungblujohannesjungblu Member Posts: 7 Contributor I

    yes, i work with this solution, but it´s annoying rename and combine attriutes if you join 15 tables, if there is a possibility an option can make this...

Sign In or Register to comment.