"Loop over datasets in repository?"

wesselwessel Member Posts: 537 Maven
edited June 2019 in Help
Dear All,

How to execute the same process for different datasets in your repository?
I can't figure out how to use the "Loop Repository" operator.

Best regards,

Wessel

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    the following process shows the basic usage: it loops over the example sets in the samples directory and delivers them as a collection (of course you could do anything else in the loop then just deliver the data...). In addition, it collects all data set sizes with a logging operator which demonstrates the usage of the predefined macros in the loop.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <process expanded="true" height="145" width="212">
          <operator activated="true" class="loop_repository" compatibility="5.1.017" expanded="true" height="76" name="Loop Repository" width="90" x="45" y="30">
            <parameter key="repository_folder" value="//Samples/data/"/>
            <parameter key="entry_type" value="IOObject"/>
            <process expanded="true" height="574" width="840">
              <operator activated="true" class="extract_macro" compatibility="5.1.017" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
                <parameter key="macro" value="size"/>
              </operator>
              <operator activated="true" class="provide_macro_as_log_value" compatibility="5.1.017" expanded="true" height="76" name="Provide Macro as Log Value (2)" width="90" x="179" y="30">
                <parameter key="macro_name" value="repository_path"/>
              </operator>
              <operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="Log" width="90" x="313" y="30">
                <list key="log">
                  <parameter key="Dataset" value="operator.Provide Macro as Log Value (2).value.macro_value"/>
                  <parameter key="Size" value="operator.Extract Macro.value.macro_value"/>
                </list>
              </operator>
              <connect from_port="repository object" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Provide Macro as Log Value (2)" to_port="through 1"/>
              <connect from_op="Provide Macro as Log Value (2)" from_port="through 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="out 1"/>
              <portSpacing port="source_repository object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Loop Repository" from_port="out 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Cheers,
    Ingo
  • wesselwessel Member Posts: 537 Maven
    Dear Ingo,

    Thanks a lot.
    This works like a charm.

    Unfortunately it gives both warnings and errors:
    - Expected ExampleSet but received IOObject.
    - Meta data is underspecified. Cannot check precondition.
    I use this in a process and get these errors more than 20 times.
    This is a bit of a bummer because now I can't see other actually important errors.

    Best regards,

    Wessel


    edit: I think the trick is to pass the meta data from the first dataset in the folder.
  • wesselwessel Member Posts: 537 Maven
    This process has no errors, although it is a bit weird that you retrieve each dataset twice:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <process expanded="true" height="432" width="1043">
          <operator activated="true" class="loop_repository" compatibility="5.1.017" expanded="true" height="76" name="Loop Repository" width="90" x="179" y="184">
            <parameter key="repository_folder" value="//Samples/data/"/>
            <parameter key="entry_type" value="IOObject"/>
            <parameter key="entry_name_macro" value="Golf"/>
            <process expanded="true" height="432" width="705">
              <operator activated="true" class="retrieve" compatibility="5.1.017" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
                <parameter key="repository_entry" value="//Samples/data/%{Golf}"/>
              </operator>
              <operator activated="true" class="remove_useless_attributes" compatibility="5.1.017" expanded="true" height="76" name="Remove Useless Attributes" width="90" x="180" y="30"/>
              <operator activated="true" class="extract_macro" compatibility="5.1.017" expanded="true" height="60" name="Extract Macro" width="90" x="315" y="30">
                <parameter key="macro" value="size"/>
              </operator>
              <operator activated="true" class="provide_macro_as_log_value" compatibility="5.1.017" expanded="true" height="76" name="Provide Macro as Log Value (2)" width="90" x="450" y="30">
                <parameter key="macro_name" value="repository_path"/>
              </operator>
              <operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="Log" width="90" x="585" y="30">
                <list key="log">
                  <parameter key="Dataset" value="operator.Provide Macro as Log Value (2).value.macro_value"/>
                  <parameter key="Size" value="operator.Extract Macro.value.macro_value"/>
                </list>
              </operator>
              <connect from_op="Retrieve" from_port="output" to_op="Remove Useless Attributes" to_port="example set input"/>
              <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Provide Macro as Log Value (2)" to_port="through 1"/>
              <connect from_op="Provide Macro as Log Value (2)" from_port="through 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="out 1"/>
              <portSpacing port="source_repository object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Loop Repository" from_port="out 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    This process has no errors, although it is a bit weird that you retrieve each dataset twice:
    Nope, it works like a charm. Don't get confused by the first two data sets: they differ only in the fact that one time there is a label and one time there is none. So there are indeed two copies of Golf in the sample repository.

    By the way: Those are not "errors" but "potential problems" as stated at top of the "Problem" view. And indeed the meta data is underspecified so it cannot be guaranteed that the process will run without actually executing it  :P

    Cheers,
    Ingo
Sign In or Register to comment.