Options

How to "unjoin" a dataset?

mdale9mdale9 Member Posts: 5 Contributor I
edited November 2018 in Help

Hi,

I have a dataset that is the result of a join and I want to break up, back into the 2 source tables with a foreign key id in one of the tables.

For example the table I have looks a bit like this:

 

Invoice_id | Product_weight | Product_Color | Product_size

1                   | 25                      | blue                 | Large

2                   | 5                        | red                   | Small

3                   | 17                      | green               | Large

4                   | 15                      | blue                 | Medium

5                   | 25                      | blue                 | Large

6                   | 25                      | blue                 | Large

 

I want to break it back in to 2 tables like this:

 

Invoice_id | Product_id

1               |1

2               |2

3               |3

4               |4

5               |1

6               |1

 

and 

Product_id | Product_weight | Product_Color | Product_size

1                   | 25                      | blue                 | Large

2                   | 5                        | red                   | Small

3                   | 17                      | green               | Large

4                   | 15                      | blue                 | Medium

 

How do I do this?

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @mdale9 - so I assume you created this via a left-hand-join like this, joining on "Product_id" as your key:

     

    Screen Shot 2018-10-22 at 11.34.11 AM.png

     

    Hence the result is what you say:

     

    Screen Shot 2018-10-22 at 11.35.14 AM.pngScreen Shot 2018-10-22 at 11.35.18 AM.pngScreen Shot 2018-10-22 at 11.35.22 AM.png

     

    So if this is the case, to "unjoin" is to simply select the first two attributes (Invoice_id and Product_id) from the result as this is exactly what join did to make them in the first place (that is left join):

     

    Screen Shot 2018-10-22 at 11.37.29 AM.png

     

    As for the other side, when you join you create duplicates. So removing duplicates should do the trick:

     

    Screen Shot 2018-10-22 at 11.40.26 AM.png

     

    Here's the process from soup to nuts:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve table1" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Local Repository/table1"/>
    </operator>
    <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve table2" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Local Repository/table2"/>
    </operator>
    <operator activated="true" class="concurrency:join" compatibility="9.0.003" expanded="true" height="82" name="Join" width="90" x="179" y="34">
    <parameter key="join_type" value="left"/>
    <list key="key_attributes">
    <parameter key="Product_id" value="Product_id"/>
    </list>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.0.003" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
    <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Product_id"/>
    <parameter key="attributes" value="Invoice_id|Product_id"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="9.0.003" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="136"/>
    <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Invoice_id|Product_id"/>
    </operator>
    <connect from_op="Retrieve table1" from_port="output" to_op="Join" to_port="left"/>
    <connect from_op="Retrieve table2" from_port="output" to_op="Join" to_port="right"/>
    <connect from_op="Join" from_port="join" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
    <connect from_op="Remove Duplicates" from_port="example set output" to_port="result 2"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Example sets are attached.


    Scott

     

     

  • Options
    mdale9mdale9 Member Posts: 5 Contributor I

    Hi Scott,

     

    Not quite.  If you look at my original dataset, I don't have a product_id in the joined dataset and hence can't extract it, unlike in your Exampleset(Join).  I need to remove duplicates, generate a new product_id and then figure out a way of getting that original link between invoice and product.

     

    thanks for taking the time to look at this though.

     

    Martin

Sign In or Register to comment.