Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"[SOLVED] Loop over files (extracting id from first line)"

earmijoearmijo Member Posts: 271 Unicorn
edited June 2019 in Help
Dear experts:

I have about 2,000 text files with the following structure:
First line: customer id followed by an colon
Next k lines : data about transactions of customers (x1,x2,x3,x4)

-----file01.txt------------------
01:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file02.txt-------------------
02:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file03.txt-------------------
03:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4
-----file04.txt--------------------
04:
x11,x12,x13,x14
x21,x22,x23,x24
.....
xk1,xk2,xk3,xk4


What I would like is to merge them into a single file with the following columns

id,x1,x2,x3,x4

Is there an easy way to do it inside RapidMiner?

Thanks in advance,

\Ernesto

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Of course you can do it with RapidMiner :D Please have a look at the attached process. It uses a Loop Files operator to iterate over all files. It reads them line by line, extracts the first one using Extract Macro, removes the first line and then splits the remaining lines at the commas.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
        <process expanded="true" height="161" width="614">
          <operator activated="true" class="loop_files" compatibility="5.2.003" expanded="true" height="76" name="Loop Files" width="90" x="313" y="30">
            <parameter key="directory" value="C:\Users\mhelf\tmp\files"/>
            <parameter key="filter" value=".*\.txt"/>
            <process expanded="true" height="619" width="1128">
              <operator activated="true" class="read_csv" compatibility="5.2.003" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
                <parameter key="csv_file" value="C:\Users\mhelf\tmp\files\file1.txt"/>
                <parameter key="column_separators" value=":"/>
                <parameter key="first_row_as_names" value="false"/>
                <list key="annotations"/>
                <parameter key="encoding" value="windows-1252"/>
                <list key="data_set_meta_data_information">
                  <parameter key="0" value="att1.true.polynominal.attribute"/>
                </list>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro" width="90" x="246" y="30">
                <parameter key="macro" value="fileId"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="att1"/>
                <parameter key="example_index" value="1"/>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro (2)" width="90" x="380" y="30">
                <parameter key="macro" value="numRows"/>
              </operator>
              <operator activated="true" class="filter_example_range" compatibility="5.2.003" expanded="true" height="76" name="Filter Example Range" width="90" x="514" y="30">
                <parameter key="first_example" value="2"/>
                <parameter key="last_example" value="%{numRows}"/>
              </operator>
              <operator activated="true" class="split" compatibility="5.2.003" expanded="true" height="76" name="Split" width="90" x="648" y="30"/>
              <operator activated="true" class="generate_attributes" compatibility="5.2.003" expanded="true" height="76" name="Generate Attributes" width="90" x="782" y="30">
                <list key="function_descriptions">
                  <parameter key="fileId" value="%{fileId}"/>
                </list>
              </operator>
              <connect from_port="file object" to_op="Read CSV" to_port="file"/>
              <connect from_op="Read CSV" from_port="output" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Extract Macro (2)" to_port="example set"/>
              <connect from_op="Extract Macro (2)" from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_op="Split" to_port="example set input"/>
              <connect from_op="Split" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.003" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
          <connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • earmijoearmijo Member Posts: 271 Unicorn
    Thank you Marius. I think is going to take me a couple of days to understand the process :-)  but it works beautifully.
Sign In or Register to comment.