Tab separated value import issue with header in file

GregMGregM Member Posts: 18 Maven
edited November 2018 in Help
I have years worth of data saved as tab separated values, unfortunately, each file also has the following header, that I can not figure out how to work with.  No matter what I have tried I can not get the following rows of data to import, it always gives an error of:  "an attribute is specified for column 1 but column 1 does not exist" and fails to recognize the dozens of columns of data below the header.  Manually removing the header enables the import wizard to work properly, but removing that manually is not really an option for thousands of datafiles.

Anyone have a good idea how to get past this issue?  I was planning on looping through the files in the folder structure, extracting the unit SN,PN, Date,Time from either the header or the file name, and combining it all together into one set of data to look at the population over time, but I am stuck at ostensibly the simplest part of the whole thing, how to automate the import of the text file.
Filename:	2014-02-06_0757_1-111-11_113541-1234.txt
Date: 2/6/2014
Time: 7:57 AM
UUT P/N: 1-111-11
UUT S/N: 113541-1234
Operator: XXXX
FTP Revision -
ATP Revision A
Medium Flowmeter Part Number: XXXXXXXXXXXXXXXXX
Medium Flowmeter Serial Number: 4546544546
Large Flowmeter Part Number: XXXXXXXXXXXXXXXXX
Large Flowmeter Serial Number: 1324564321
ATE Serial Number: 00

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    there is no possibility to deal with headers out of the box, but there is a workaround consisting of several operators. Please have a look at the process below for an example, and be sure to set the last_example parameter of Filter Example Range to the number of header lines in your files.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="6.0.002" expanded="true" height="76" name="Create Data" width="90" x="45" y="30">
            <process expanded="true">
              <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
                <parameter key="text" value="this is a header&#10;this is also a header line&#10;A&#9;B&#9;C&#10;1&#9;2&#9;3&#10;4&#9;5&#9;6&#10;"/>
              </operator>
              <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="180" y="30"/>
              <connect from_op="Create Document" from_port="output" to_op="Write Document" to_port="document"/>
              <connect from_op="Write Document" from_port="file" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="read_csv" compatibility="6.0.002" expanded="true" height="60" name="Read CSV" width="90" x="179" y="30">
            <parameter key="column_separators" value="~"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="6.0.002" expanded="true" height="76" name="Filter Example Range" width="90" x="313" y="30">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="2"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="true" class="write_csv" compatibility="6.0.002" expanded="true" height="76" name="Write CSV" width="90" x="447" y="30">
            <parameter key="write_attribute_names" value="false"/>
            <parameter key="quote_nominal_values" value="false"/>
          </operator>
          <operator activated="true" class="read_csv" compatibility="6.0.002" expanded="true" height="60" name="Read CSV (2)" width="90" x="581" y="30">
            <parameter key="csv_file" value="C:\Users\mhelf\AppData\Local\Temp\rm_file_952429460995993752.dump"/>
            <parameter key="column_separators" value="&#9;"/>
            <parameter key="use_quotes" value="false"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <connect from_op="Create Data" from_port="out 1" to_op="Read CSV" to_port="file"/>
          <connect from_op="Read CSV" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Write CSV" to_port="input"/>
          <connect from_op="Write CSV" from_port="file" to_op="Read CSV (2)" to_port="file"/>
          <connect from_op="Read CSV (2)" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • GregMGregM Member Posts: 18 Maven
    Interesting, I will have to try that, however now I think there is something even more odd going on.

    If I open the file in excel and specify tab delimited, it pulls in fine.  if I now save it back as a tab delimited file, rapidminer recognizes the tabs in the data section of the document and properly imports the data in the same manner as I've had it do on other header filled files many times before.  Somehow the tab delimited text file that is automatically written by labview is different than the tab delimited text file as saved by excel.

    I checked the tabs, both files specify the same ascii character code for the tab, so I am a bit stumped....
Sign In or Register to comment.