[SOLVED] Creating data sets controlled by the content of file

KirthKirth Member Posts: 2 Contributor I
edited November 2018 in Help

I am trying to realize the following task in RM: starting from a data table stored in the repository
I want to create new data tables depending on entries in a file.
In the simplest case the file would contain several lines, where each line contains
a list of names of attributes of the data table. I then want to read the file line by line
and for each line create a new data table that consists only of those columns of the original one
specified by the list of attribute names in the current line.

Any hints how to do that?


  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Kirth,

    below I pasted a process which will do what you need. Replace the subprocess "Generate Fake File Data" by a Read CSV operator which reads the file which contains the attributes that you want to select. Be sure to select a character as delimiter which does NOT occur in the file. Thus, Read CSV will create an example set with exactly one attribute.

    I assumed that the attributes are separated by commas in your file. The Replace operator converts each comma into the or selection character (|) for regular expressions. The resulting regular expression is used inside the Loop Examples operator to select the attributes accordingly.

    This process assumes that your attributes don't contain any characters which have a special meaning in regular expressions.

    Best regards,
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.014">
      <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
        <process expanded="true" height="477" width="706">
          <operator activated="true" class="generate_nominal_data" compatibility="5.1.014" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="30"/>
          <operator activated="true" class="remember" compatibility="5.1.014" expanded="true" height="60" name="Remember" width="90" x="179" y="30">
            <parameter key="name" value="original_data"/>
            <parameter key="io_object" value="ExampleSet"/>
          <operator activated="true" class="subprocess" compatibility="5.1.014" expanded="true" height="76" name="Generate Fake File Data" width="90" x="45" y="120">
            <process expanded="true" height="477" width="706">
              <operator activated="true" class="generate_data_user_specification" compatibility="5.1.014" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="112" y="30">
                <list key="attribute_values">
                  <parameter key="selected_attributes" value="&quot;att1,att2,att5&quot;"/>
                <list key="set_additional_roles"/>
              <operator activated="true" class="generate_data_user_specification" compatibility="5.1.014" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="112" y="120">
                <list key="attribute_values">
                  <parameter key="selected_attributes" value="&quot;att2,att4&quot;"/>
                <list key="set_additional_roles"/>
              <operator activated="true" class="append" compatibility="5.1.014" expanded="true" height="94" name="Append" width="90" x="315" y="30"/>
              <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
              <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
              <connect from_op="Append" from_port="merged set" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
          <operator activated="true" class="replace" compatibility="5.1.014" expanded="true" height="76" name="Replace" width="90" x="179" y="120">
            <parameter key="replace_what" value=","/>
            <parameter key="replace_by" value="\|"/>
          <operator activated="true" class="loop_examples" compatibility="5.1.014" expanded="true" height="94" name="Loop Examples" width="90" x="313" y="120">
            <process expanded="true" height="477" width="706">
              <operator activated="true" class="extract_macro" compatibility="5.1.014" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
                <parameter key="macro" value="selection_regex"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="selected_attributes"/>
                <parameter key="example_index" value="%{example}"/>
              <operator activated="true" class="recall" compatibility="5.1.014" expanded="true" height="60" name="Recall" width="90" x="45" y="120">
                <parameter key="name" value="original_data"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="remove_from_store" value="false"/>
              <operator activated="true" class="select_attributes" compatibility="5.1.014" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="120">
                <parameter key="attribute_filter_type" value="regular_expression"/>
                <parameter key="regular_expression" value="%{selection_regex}"/>
              <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
              <connect from_op="Recall" from_port="result" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Remember" to_port="store"/>
          <connect from_op="Generate Fake File Data" from_port="out 1" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
  • KirthKirth Member Posts: 2 Contributor I
    Thanks Marius!

    The code works well and as far as I see can also be extended
    to the other types of the data generation process that I have to

Sign In or Register to comment.