How to break apart a table into 2 tables to be joined

mdale9mdale9 Member Posts: 5 Contributor I
edited December 2018 in Help

Hi,

I have a dataset that is the result of a join and I want to break up, back into the 2 source tables with a foreign key id in one of the tables.

For example the table I have looks a bit like this:

 

Invoice_id | Product_weight | Product_Color | Product_size

1                   | 25                      | blue                 | Large

2                   | 5                        | red                   | Small

3                   | 17                      | green               | Large

4                   | 15                      | blue                 | Medium

5                   | 25                      | blue                 | Large

6                   | 25                      | blue                 | Large

 

I want to break it back in to 2 table like this

 

Invoice_id | Product_id

1               |1

2               |2

3               |3

4               |4

5               |1

6               |1

 

and 

Product_id | Product_weight | Product_Color | Product_size

1                   | 25                      | blue                 | Large

2                   | 5                        | red                   | Small

3                   | 17                      | green               | Large

4                   | 15                      | blue                 | Medium

 

I've tried splitting my source dataset into 2 copies and removing duplicates, adding a generated id to each table and re-joining them to get the link between them but it doesn't quite work how I would expect.

Can anyone help me?

 

Thanks

Martin

Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi Martin,

     

    This was a nice problem to solve - you got my attention right away :-)

     

    I think you have been thinking about the right building blocks already yourself (like duplicate removal etc.).  The XML below shows a process which is doing the whole thing.  I have applied it to the attached Excel file which is pretty much the same as the example data you provided.  For other data sets you will need to make small changes.

     

    Hope this helps,
    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="9.0.002" expanded="true" height="68" name="Read Excel" width="90" x="45" y="136">
    <parameter key="excel_file" value="C:\Users\IngoMierswa\Desktop\products.xlsx"/>
    <list key="annotations"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Invoice ID.true.integer.attribute"/>
    <parameter key="1" value="Weight.true.integer.attribute"/>
    <parameter key="2" value="Color.true.polynominal.attribute"/>
    <parameter key="3" value="Size.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.0.002" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
    <operator activated="true" class="generate_aggregation" compatibility="9.0.002" expanded="true" height="82" name="Generate Aggregation (2)" width="90" x="849" y="187">
    <parameter key="attribute_name" value="product_id"/>
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Invoice ID"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="aggregation_function" value="concatenation"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Invoice ID"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="9.0.002" expanded="true" height="103" name="Remove Duplicates" width="90" x="447" y="34">
    <parameter key="attribute" value="Invoice ID"/>
    <parameter key="attributes" value="Color|Size|Weight"/>
    </operator>
    <operator activated="true" class="generate_aggregation" compatibility="9.0.002" expanded="true" height="82" name="Generate Aggregation" width="90" x="581" y="34">
    <parameter key="attribute_name" value="product_id"/>
    <parameter key="aggregation_function" value="concatenation"/>
    </operator>
    <operator activated="true" class="generate_id" compatibility="9.0.002" expanded="true" height="82" name="Generate ID" width="90" x="715" y="34"/>
    <operator activated="true" class="multiply" compatibility="9.0.002" expanded="true" height="103" name="Multiply (2)" width="90" x="849" y="34"/>
    <operator activated="true" class="concurrency:join" compatibility="9.0.002" expanded="true" height="82" name="Join" width="90" x="983" y="187">
    <parameter key="join_type" value="right"/>
    <list key="key_attributes">
    <parameter key="product_id" value="product_id"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes (2)" width="90" x="1117" y="187">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="product_id"/>
    <parameter key="attributes" value="Invoice ID|id"/>
    </operator>
    <operator activated="true" class="rename" compatibility="9.0.002" expanded="true" height="82" name="Rename" width="90" x="1251" y="187">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="product_id"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes (3)" width="90" x="983" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="product_id"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="rename" compatibility="9.0.002" expanded="true" height="82" name="Rename (2)" width="90" x="1117" y="34">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="product_id"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Generate Aggregation (2)" to_port="example set input"/>
    <connect from_op="Generate Aggregation (2)" from_port="example set output" to_op="Join" to_port="right"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
    <connect from_op="Remove Duplicates" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
    <connect from_op="Generate Aggregation" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Select Attributes (3)" to_port="example set input"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_port="result 2"/>
    <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
    <connect from_op="Rename (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="147"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>  
Sign In or Register to comment.