how to import multiple files

niccayniccay Member Posts: 4 Contributor I
edited November 2018 in Help
Hi,

is there any chance to import multiple files at once? I've got about 70 .csv files having the same scheme that I want to import into a rapidminer repository. Actually, I can't figure out how to solve this problem without any user interaction :-/

My quick and dirty workaround is a little ruby script that
first) reads all filenames of a given directory and
second) creates a rapidminer project-file containing lots of readcsv and store operatores.

I guess, that's not the way you're meant to import multiple files  :)

Answers

  • StaryVenaStaryVena Member Posts: 126 Contributor II
    Hi,
    you can use "loop files" operator. Than you will probably need "append" operator to merge all example set from collection to one.

    Best,
    Vaclav
  • mgriffithmgriffith Member Posts: 1 Contributor I
    Can you post a simple example which reads 3 csv files from a single directory and then appends them into one file and stores it.
  • earmijoearmijo Member Posts: 270 Unicorn
    This process reads three csv files from the directory /Users/Ernesto/Desktop/files and appends them.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="400" width="659">
          <operator activated="true" class="loop_files" compatibility="5.2.008" expanded="true" height="76" name="Loop Files" width="90" x="246" y="30">
            <parameter key="directory" value="/Users/Ermesto/Desktop/files"/>
            <process expanded="true" height="418" width="677">
              <operator activated="true" class="read_csv" compatibility="5.2.008" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
                <parameter key="csv_file" value="/Users/Carlos/Desktop/files/file01.csv"/>
                <parameter key="column_separators" value=","/>
                <parameter key="first_row_as_names" value="false"/>
                <list key="annotations">
                  <parameter key="0" value="Name"/>
                </list>
                <parameter key="encoding" value="MacRoman"/>
                <list key="data_set_meta_data_information">
                  <parameter key="0" value="y.true.integer.attribute"/>
                  <parameter key="1" value="x.true.integer.attribute"/>
                </list>
              </operator>
              <connect from_port="file object" to_op="Read CSV" to_port="file"/>
              <connect from_op="Read CSV" from_port="output" to_port="out 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
          <connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • underlinesunderlines Member Posts: 1 Contributor I

    Create the following process:

     

    Loop Files -> Append -> Write CSV

     

    Click on Loop Files, define the parameter directory to point to the directory where your CSV files (or other files readable by rapidminer) are.

    Double click on Loop Files to go into this sub-process.

     

    Create the following sub process:

    fil -> Read CSV -> Select Attributes -> out

     

    Spoiler

    "fil" and "out" are not Operator objects, they are the connectors on the left and right border of the window that look like knobs.

    Click on Select Attributes and select the parameter attribute filter type to either subset or regular_expression.

    For subset, click on the Select Attributes... button, and add the attributes (columns) of your CSVs that you want to have in your merged output. Add them in the right list of the window by typing the name and clicking the plus icon.

    For regular_expression you can define a list of attriutes (columns) like this: .*attribute1.*|.*attribute2.*|.*attribute3.*

    example:

     

    .*mail.*|.*date_submitted.*|.*page_url.*

    Then you are done.

     

     

    Merge columns with different names into the same column (attribute):

    In case you have columns in your CSV with different naming, like: e-mail, eMail, e_mail you can do the following:

    In your existing Select Attributes Object choose regular_expressions. Define a regular_expression that contains all columns you want, and also the variants. If I have the following columns:

    • E_mail
    • e-Mail
    • Mail
    • date_submitted
    • name

    I would create the following regular_expression:

     

     

    .*mail.*|.*date_submitted.*|.*name.*

    This will still create an output with different columns (attribute). To merge the 3 Email columns into one, you have to rename them to be identical. Add a Rename by Replacing Object after the already existing Select Attributes.

     

    On the Rename by Replacing Object select regular_expression as the attribute filter type. Then fill out the fields below like:

    regular expression: .*mail.*

    replace what: .*

    replace by: mail

     

    I attached a gif of my process, to clarify.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    very nice explanation, @underlines !

     

    Scott

     

     

  • penmenpenmen Member Posts: 2 Contributor I
    Great answer @underlines!

    Could someone please suggest what to do:
    if the columns being imported don't have column headers? or even,
    What should be provided in the sub-process while setting the filter for Select Attribute, in the line above?
  • rjones13rjones13 Member Posts: 145 Unicorn
    Hi @penmen,

    When you import a file without headers, it will assign default names of att1, att2, att3 and so on. If the order of attributes is always the same , then it doesn't cause a problem, you can just select the attributes you need (e.g. att1 and att3), and then rename the attributes after the append described above. If there's isn't a consistent order, then you will probably have to implement some more complicated logic to assign the correct attribute names prior to the Select Attributes step.

    Hope this helps,

    Best,

    Roland
  • penmenpenmen Member Posts: 2 Contributor I
    Thanks @rjones13 !
    I was able to resolve my issue above by setting 'all attributes' in the Select Attribute first.
    Then, after Renaming by Generic names, I was able to any required column with 'Select Attributes' and by providing the concatenated string value of my generic name + column number.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    another option is to use the "data set meta data configuration" parameter in the operator. This gives you the option the give a given index a name.
    If you run the wizard its automatically configured. Its a good practice to first run the wizard on one file and then c/p the configured opertor into your loop.
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • nataliarelishnataliarelish Member Posts: 4 Learner I
    Hello there
    In my experience first organize all the CSV file in a single folder then open RapidMiner and explore the Repository choose Import Data from the toolbar then select Import Data from Files organize the schema make sure the option is selected then confirm setting
Sign In or Register to comment.