Append multiple 1 row datasets together

mobmob Member Posts: 37 Contributor II
edited November 2018 in Help
if I have a folder of 1 row datasets with the same attribute count and type how do I append them into 1 larger dataset? The append operator obviously looks for 2 inputs but is it smart enough to know that the 2nd dataset might be coming from a loop operator?

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    you can connect as many inputs to the append operator as you like.  It just expands.  Just make sure that there are exactly the same number of attributes in each one and they are the same type.

    Scott
  • mobmob Member Posts: 37 Contributor II
    I'm more trying to connect a loop example sets output to 1 of the append inputs, have no other inputs on the append operator and end up with 1 large example set (row count) instead of multiple datasets each with 1 row.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Ah I understand.  Some loop operators (like Loop Examples) do NOT actually filter out - they pass the entire example set through and just create an "example" macro to use in the loop.  Other loop operators (like Loop Values) DO filter and only pass the filtered example set through.  Then you need to append back together again.

    Loop Examples with Iris sample data set
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve Iris" width="90" x="112" y="120">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="loop_examples" compatibility="6.4.000" expanded="true" height="76" name="Loop Examples" width="90" x="246" y="120">
            <process expanded="true">
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Loop Values with Iris sample data set
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve Iris" width="90" x="112" y="120">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="loop_values" compatibility="6.4.000" expanded="true" height="76" name="Loop Values" width="90" x="246" y="120">
            <parameter key="attribute" value="label"/>
            <process expanded="true">
              <connect from_port="example set" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="76" name="Append" width="90" x="380" y="120"/>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Make sense?

    Scott
  • mobmob Member Posts: 37 Contributor II
    Its more a pre-processing step to create a new dataset from multiple smaller ones which all have the same
    If you look at this simple setup the append operator is fine if I have a small number of datasets  (x = 5) I can wire up individually but if x = 100 or 1000 then it becomes alot of work and very messy. Is there a way in rapidminer to accomplish this using ? Collections would just group them together but I need to end up with 1 dataset and forget about the individual datasets from then on.

    Is it a case of needing to start a different way so i don't end up with 1000's of small datasets in the first place

    <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="5.3.015" expanded="true" height="60" name="Data x = 1" width="90" x="112" y="120"/>
          <operator activated="true" class="generate_data" compatibility="5.3.015" expanded="true" height="60" name="Data X = n" width="90" x="112" y="255"/>
          <operator activated="true" class="append" compatibility="5.3.015" expanded="true" height="94" name="Append" width="90" x="313" y="165"/>
          <operator activated="true" class="store" compatibility="5.3.015" expanded="true" height="60" name="Store" width="90" x="447" y="165"/>
          <connect from_op="Data x = 1" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Data X = n" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Store" to_port="input"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Actually what Scott might not have noticed is that the Loop Examples operator has two output ports. 
    One is Exa which is the last iteration of your process if connected (or the last iteration of the loop depending on how you connect it). 
    The other port name out is the one you want to connect to as it outputs a collection of all the loops in your set. 

    Here's Scott's Iris & Loop Examples example again reworked to show two ways of getting Output from the loop operator.

    The 3rd way is using Remember & Recall if your operator doesn't have an output port.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve Iris" width="90" x="112" y="120">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" breakpoints="after" class="loop_examples" compatibility="6.4.000" expanded="true" height="94" name="Loop Examples" width="90" x="246" y="120">
            <process expanded="true">
              <connect from_port="example set" to_port="output 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="76" name="Append" width="90" x="447" y="165"/>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
          <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    if you already have all your datasets in a repository folder, you can just loop over it and append the collection.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.000">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.5.000" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="loop_repository" compatibility="6.5.000" expanded="true" height="76" name="Loop Repository" width="90" x="45" y="30">
           <parameter key="repository_folder" value="//Local Repository/LoopTest/"/>
           <parameter key="entry_type" value="IOObject"/>
           <process expanded="true">
             <connect from_port="repository object" to_port="out 1"/>
             <portSpacing port="source_repository object" spacing="0"/>
             <portSpacing port="source_in 1" spacing="0"/>
             <portSpacing port="sink_out 1" spacing="0"/>
             <portSpacing port="sink_out 2" spacing="0"/>
           </process>
           <description align="center" color="transparent" colored="false" width="126">Select the desired repository folder (and optional filters)</description>
         </operator>
         <operator activated="true" class="append" compatibility="6.5.000" expanded="true" height="76" name="Append" width="90" x="179" y="30">
           <description align="center" color="transparent" colored="false" width="126">Appends all ExampleSets of the input collection</description>
         </operator>
         <connect from_op="Loop Repository" from_port="out 1" to_op="Append" to_port="example set 1"/>
         <connect from_op="Append" from_port="merged set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards,
    Marco
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    & here's a more complicated example in case you find any datasets that are missing attributes (or have additional ones)
    By using Remember, Recall operators you can combine the examplesets together (in this case using Union rather than Append). 
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Data x = 1" width="90" x="45" y="75"/>
          <operator activated="true" class="remember" compatibility="6.4.000" expanded="true" height="60" name="Remember" width="90" x="179" y="165">
            <parameter key="name" value="mycollection"/>
            <description align="center" color="transparent" colored="false" width="126">This happens first to initialise the remember</description>
          </operator>
          <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Data X = n" width="90" x="45" y="345"/>
          <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Data X = n (2)" width="90" x="45" y="435">
            <parameter key="number_of_attributes" value="3"/>
          </operator>
          <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Data X = n (3)" width="90" x="45" y="525">
            <parameter key="number_of_attributes" value="8"/>
          </operator>
          <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Data X = n (4)" width="90" x="45" y="615">
            <parameter key="number_of_attributes" value="10"/>
          </operator>
          <operator activated="true" class="collect" compatibility="6.4.000" expanded="true" height="130" name="Collect" width="90" x="179" y="480"/>
          <operator activated="true" class="loop_collection" compatibility="6.4.000" expanded="true" height="76" name="Loop Collection" width="90" x="380" y="435">
            <process expanded="true">
              <operator activated="true" class="recall" compatibility="6.4.000" expanded="true" height="60" name="Recall" width="90" x="112" y="30">
                <parameter key="name" value="mycollection"/>
              </operator>
              <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="112" y="390"/>
              <operator activated="true" class="union" compatibility="6.4.000" expanded="true" height="76" name="Union" width="90" x="179" y="165"/>
              <operator activated="true" class="remember" compatibility="6.4.000" expanded="true" height="60" name="Remember (2)" width="90" x="313" y="75">
                <parameter key="name" value="mycollection"/>
              </operator>
              <connect from_port="single" to_op="Multiply" to_port="input"/>
              <connect from_op="Recall" from_port="result" to_op="Union" to_port="example set 1"/>
              <connect from_op="Multiply" from_port="output 1" to_op="Union" to_port="example set 2"/>
              <connect from_op="Multiply" from_port="output 2" to_port="output 1"/>
              <connect from_op="Union" from_port="union" to_op="Remember (2)" to_port="store"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Then this</description>
          </operator>
          <operator activated="true" class="recall" compatibility="6.4.000" expanded="true" height="60" name="Recall (2)" width="90" x="514" y="435">
            <parameter key="name" value="mycollection"/>
            <description align="center" color="transparent" colored="false" width="126">And lastly</description>
          </operator>
          <connect from_op="Data x = 1" from_port="output" to_op="Remember" to_port="store"/>
          <connect from_op="Data X = n" from_port="output" to_op="Collect" to_port="input 1"/>
          <connect from_op="Data X = n (2)" from_port="output" to_op="Collect" to_port="input 2"/>
          <connect from_op="Data X = n (3)" from_port="output" to_op="Collect" to_port="input 3"/>
          <connect from_op="Data X = n (4)" from_port="output" to_op="Collect" to_port="input 4"/>
          <connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 2"/>
          <connect from_op="Recall (2)" from_port="result" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Nice examples, and nice use of the new annotation feature!  ;)

    Why did I never notice the second output of the Loop Examples operator?  :o

    Scott
Sign In or Register to comment.