Problem with combining all example set from IO Object Collection

binsetyawanbinsetyawan Member Posts: 46 Guru
edited November 2018 in Help

Hello everyone

 

I'm running a loop to create each ExampleSet I end up with an IOObjectCollection on the output. I got a problem with joining all example sets that i got from looping attributes into one example set. i've tried all join operator but im stuck on it. I set attribute "No" as an ID and the value is alike with each other example set.  For example my data are like this.

example set 1 :

No  att1

1

2                          

example set 2 :

No att2

1

2

example set 3 :

No att3

1

2

 

the result that i want is like this

example set :

No att1 att2 att3

1

2

 

i've tried looking for a reference, and i ended up find similiar post like this but still im stuck on it, here is the seimiliar post http://community.rapidminer.com/t5/Original-Rapid-I-Forum/Combining-Example-Set-Attributes/m-p/12879

Best Answers

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    Hi,

     

    I have attached an example process and the XML which should solve your problem.

    Some key takeaways:

    1. The solution uses the Join operator and Remember / Recall within a Loop Collection.
    2. Joining needs an ID attribute - Either you create one or you use an existing one which can be used ==> Then be sure you use the desired join type
    3. IDs need to have the same Value type (e.g. Numerical). Here the Blending -> Attributes -> Types Operators can help
    4. In order to overcome the problem that you need to have always two ExampleSets for a Join operation I Remember the first one
    5. Each execution of the Loop the Remembered dataset is Recalled, Joined and again Remembered
    6. In the end you receive the final dataset which can be Recalled outside of the Loop Collection

    Please keep in mind that Remember / Recall are great operators but I do not recommend to use them when it comes to handling huge datasets.

     

    Best,

    Edin

     

    Here the XML:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data (2)" width="90" x="45" y="34">
    <parameter key="number_of_attributes" value="1"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.5.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.001" expanded="true" height="124" name="Multiply" width="90" x="313" y="34"/>
    <operator activated="true" class="rename" compatibility="7.5.001" expanded="true" height="82" name="Rename (4)" width="90" x="447" y="238">
    <parameter key="old_name" value="att1"/>
    <parameter key="new_name" value="att3"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.5.001" expanded="true" height="82" name="Rename (3)" width="90" x="447" y="136">
    <parameter key="old_name" value="att1"/>
    <parameter key="new_name" value="att2"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" breakpoints="after" class="collect" compatibility="7.5.001" expanded="true" height="124" name="Collect" width="90" x="581" y="34"/>
    <operator activated="true" class="loop_collection" compatibility="7.5.001" expanded="true" height="68" name="Loop Collection (2)" width="90" x="715" y="34">
    <parameter key="set_iteration_macro" value="true"/>
    <process expanded="true">
    <operator activated="true" class="generate_id" compatibility="7.5.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="179" y="238"/>
    <operator activated="true" class="branch" compatibility="7.5.001" expanded="true" height="82" name="Branch (2)" width="90" x="514" y="238">
    <parameter key="condition_type" value="expression"/>
    <parameter key="expression" value="%{iteration}==1"/>
    <process expanded="true">
    <operator activated="true" class="remember" compatibility="7.5.001" expanded="true" height="68" name="Remember (3)" width="90" x="45" y="34">
    <parameter key="name" value="dataset"/>
    </operator>
    <connect from_port="condition" to_op="Remember (3)" to_port="store"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="recall" compatibility="7.5.001" expanded="true" height="68" name="Recall (3)" width="90" x="45" y="34">
    <parameter key="name" value="dataset"/>
    </operator>
    <operator activated="true" class="join" compatibility="7.5.001" expanded="true" height="82" name="Join (2)" width="90" x="179" y="85">
    <parameter key="join_type" value="left"/>
    <list key="key_attributes"/>
    </operator>
    <operator activated="true" class="remember" compatibility="7.5.001" expanded="true" height="68" name="Remember (4)" width="90" x="313" y="85">
    <parameter key="name" value="dataset"/>
    </operator>
    <connect from_port="condition" to_op="Join (2)" to_port="right"/>
    <connect from_op="Recall (3)" from_port="result" to_op="Join (2)" to_port="left"/>
    <connect from_op="Join (2)" from_port="join" to_op="Remember (4)" to_port="store"/>
    <portSpacing port="source_condition" spacing="105"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    </process>
    </operator>
    <connect from_port="single" to_op="Generate ID (2)" to_port="example set input"/>
    <connect from_op="Generate ID (2)" from_port="example set output" to_op="Branch (2)" to_port="condition"/>
    <portSpacing port="source_single" spacing="189"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="315" resized="true" width="378" x="63" y="30">Either &lt;br/&gt;- Generate an ID&lt;br/&gt;- Set the Role for an attribute to ID&lt;br/&gt;&lt;br/&gt;Important is that the attribute names in the final exampleset must be unique&lt;br/&gt;&lt;br/&gt;In addition the value type (Numerical vs. Polynominal) of the ID attribute has to be the same for each ExampleSet</description>
    </process>
    </operator>
    <operator activated="true" class="recall" compatibility="7.5.001" expanded="true" height="68" name="Recall (2)" width="90" x="849" y="34">
    <parameter key="name" value="dataset"/>
    </operator>
    <connect from_op="Generate Data (2)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Collect" to_port="input 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Rename (3)" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Rename (4)" to_port="example set input"/>
    <connect from_op="Rename (4)" from_port="example set output" to_op="Collect" to_port="input 3"/>
    <connect from_op="Rename (3)" from_port="example set output" to_op="Collect" to_port="input 2"/>
    <connect from_op="Collect" from_port="collection" to_op="Loop Collection (2)" to_port="collection"/>
    <connect from_op="Recall (2)" from_port="result" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    I included a Breakpoint in my solution right after the Collect Operator. It is depicted with a red square symbol.

    A Breakpoint pauses the Process and shows the intermediate result.

     

    You have three options:

    1. Before starting the Process:
      1. Remove the Breakpoint by clicking on the Operator where the Breakpoint is assigned and press the Shortkey F7
      2. Remove the Breakpoint by rightclicking on the Operator where the Breakpoint is assigned and uncheck the selection "Breakpoint After"
    2. After starting the Process: Resume the Process by clicking again on Run Process (Shortkey F11)

    Best regards,

    Edin

     

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    The Process itself is correct.

     

    The reason for your problem is that each role (as well as attribute name) can only occur once in each exampleset. Therefore the prediction is always overwritten.

    Thus you need to change the role for each attribute. In case all attributes have different names you can use a similar solution as depicted in the screenshot below.

    image.png

     

     

    Best regards,

    Edin

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can append these all together but first the attributes will need to be renamed so the datset has the same structure (attributes names and data types).  Try the Rename by Generic Names followed by an Append and you should get a resulting dataset that you can then transpose.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • binsetyawanbinsetyawan Member Posts: 46 Guru

    i've tried your recomendation but error appears, it said "duplicate attribute name". I put Rename by Generic Names inside loop attributes operator and append, transpose outside the loop operators

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I wouldn't put the Rename by Generic into the Loop, I'd do it on the outside of the loop. 

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    it comes error too, it said that "your connection is producing worng type data". Maybe, because after the loop, the type of data is IO Object Collection and Rename by Generic name only expect a example set

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    thank you for the reference of ooperator, the tips and the example too, i'll try it with my model that i built.

    *P.S : When i run your example, it still appears object collection with some example sets

     

    Regards,

    Bintang

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    I've looking for another example and i've found a model that similiar with yours and the result is what i looking for. But, when i tried with my model, it appears an error on recall operator inside branch operator, it said that "no object with name X was found during retrieval from the object store", even though i've adjusted with the model.

     

    Here is the xml code from the model that i've adjusted to

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.008">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.1.000-SNAPSHOT" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="6.1.000-SNAPSHOT" expanded="true" height="76" name="Subprocess" width="90" x="112" y="30">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
    <list key="attribute_values">
    <parameter key="id" value="1"/>
    <parameter key="col1" value="48"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="120">
    <list key="attribute_values">
    <parameter key="id" value="2"/>
    <parameter key="col1" value="4"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="append" compatibility="6.1.000-SNAPSHOT" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification (3)" width="90" x="45" y="210">
    <list key="attribute_values">
    <parameter key="id" value="1"/>
    <parameter key="col2" value="9"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification (4)" width="90" x="45" y="300">
    <list key="attribute_values">
    <parameter key="id" value="2"/>
    <parameter key="col2" value="7"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="append" compatibility="6.1.000-SNAPSHOT" expanded="true" height="94" name="Append (2)" width="90" x="179" y="210"/>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification (5)" width="90" x="45" y="390">
    <list key="attribute_values">
    <parameter key="id" value="1"/>
    <parameter key="col3" value="88"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Generate Data by User Specification (6)" width="90" x="45" y="480">
    <list key="attribute_values">
    <parameter key="id" value="2"/>
    <parameter key="col3" value="78"/>
    </list>
    <list key="set_additional_roles">
    <parameter key="id" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="append" compatibility="6.1.000-SNAPSHOT" expanded="true" height="94" name="Append (3)" width="90" x="179" y="390"/>
    <operator activated="true" class="collect" compatibility="6.1.000-SNAPSHOT" expanded="true" height="112" name="Collect" width="90" x="380" y="210"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_op="Collect" to_port="input 1"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
    <connect from_op="Append (2)" from_port="merged set" to_op="Collect" to_port="input 2"/>
    <connect from_op="Generate Data by User Specification (5)" from_port="output" to_op="Append (3)" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (6)" from_port="output" to_op="Append (3)" to_port="example set 2"/>
    <connect from_op="Append (3)" from_port="merged set" to_op="Collect" to_port="input 3"/>
    <connect from_op="Collect" from_port="collection" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="multiply" compatibility="6.1.000-SNAPSHOT" expanded="true" height="94" name="Multiply (2)" width="90" x="246" y="30"/>
    <operator activated="true" class="select" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Select (2)" width="90" x="447" y="30"/>
    <operator activated="true" class="remember" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Remember" width="90" x="581" y="30">
    <parameter key="name" value="1"/>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="6.1.000-SNAPSHOT" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="165">
    <parameter key="set_iteration_macro" value="true"/>
    <process expanded="true">
    <operator activated="true" class="branch" compatibility="6.1.000-SNAPSHOT" expanded="true" height="76" name="Branch" width="90" x="112" y="120">
    <parameter key="condition_type" value="expression"/>
    <parameter key="condition_value" value="%{iteration}==1"/>
    <process expanded="true">
    <connect from_port="condition" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="recall" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Recall" width="90" x="112" y="75">
    <parameter key="name" value="1"/>
    </operator>
    <operator activated="true" class="join" compatibility="6.1.000-SNAPSHOT" expanded="true" height="76" name="Join" width="90" x="246" y="30">
    <list key="key_attributes"/>
    </operator>
    <operator activated="true" class="remember" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Remember (2)" width="90" x="380" y="30">
    <parameter key="name" value="1"/>
    </operator>
    <connect from_port="condition" to_op="Join" to_port="left"/>
    <connect from_op="Recall" from_port="result" to_op="Join" to_port="right"/>
    <connect from_op="Join" from_port="join" to_op="Remember (2)" to_port="store"/>
    <connect from_op="Remember (2)" from_port="stored" to_port="input 1"/>
    <portSpacing port="source_condition" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_input 1" spacing="0"/>
    <portSpacing port="sink_input 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="single" to_op="Branch" to_port="condition"/>
    <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="recall" compatibility="6.1.000-SNAPSHOT" expanded="true" height="60" name="Recall (2)" width="90" x="581" y="165">
    <parameter key="name" value="1"/>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Select (2)" to_port="collection"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Select (2)" from_port="selected" to_op="Remember" to_port="store"/>
    <connect from_op="Recall (2)" from_port="result" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    I used the XML you posted in my RapidMiner (v 7.5.001) and it worked perfectly.

    Did I miss something?

     

    Best,

    Edin

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    when i run your xml code, it appears IO Object Collection with 3 example sets that not yet joined into one example set. Therefore im looking for another reference and then i found other xml code (on my previous reply)

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    ah thank you so much, i didn't realize there is breakpoint (im still new with Rapidminer). i'll try with my model

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    it only appears first example set when joined. Here is my model that i've combined with your xml code. Is there any mistake in my configuration?

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    Ah so thats the problem, thank you for help me sir! Wait for a question about another topic from me :smileyvery-happy:

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    one thing that make me curious, in my model that i build, all example set run same neural network model. The thing is every example set have their unique neural network model, right? can i run neural network model with different neuron size, training cycle, learning rate, momentum to each example set? how to do?

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    You may have a look into the Operator Optimize Parameters (Grid).

    Within the Operator Help there is a Tutorial process linked which should point you in the right direction.

     

    Best regards,

    Edin

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    yeah i've tried it and i found the best ANN model for each example set but how to apllied it for each example set? if i put it on neural network operator, it only for one Neural Network Model but it means that this one model is applied to all example set right?

  • mskinnermskinner Member Posts: 10 Contributor I

    i tried the posted solution.

     

    I found that i like it with a union operator instead of teh join.  witht he join it would either repeat the column header modified by source it came from or only have one instance of attribute value. 

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    @binsetyawan Since you are doing everything within Loop Attributes, each Attribute has its own model. Does that answer your question?

     

    @mskinner I suppose that depends on your use case. Did you just replace the Join with the Union Operator? Since Union simply appends your ExampleSets, the number of examples in the final ExampleSet can drastically increase.

     

    Best,

    Edin

     

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    yeah i've tried optimization grid and each example set got their own model, but how to apply their own model on each example set when i use attribute loop?

     

    is it possible with rapidminer? @Edin_Klapic @Thomas_Ott

  • mskinnermskinner Member Posts: 10 Contributor I

    i observed teh exact opposite performance.  with the join any attribute that was there was renamed and added as a new attributs so the files size was huge.

     

    when i used union it add teh new example uner teh appropriate atribut if it existed and only create a new atribute when it did not already exist in set it was being joined with.

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    yeah its depend on your case, in my case i need each example set to create new attribute

Sign In or Register to comment.