Getting rid of the first three lines of a csv file (wizard)

frangonve · September 2013

Hello,

The csv file to be imported has three useless lines (rows) at the top.

In the fourth line attribute names are found.
From the fifth line onwards numeric values are found.

While I can select only the useful data in the first window of the wizard with the mouse, this selection is not used by the wizzard, as next wizard windows still contain the useless data at the top and due to that I can't get the right attributes.

This behaviour contrasts with the excel wizard where I can effectively select and use the rows that I want.

Is there another way to get rid of these three first lines?

Cheers

Francisco

Marco_Boeck · September 2013

Hi,

you could manually add a comment symbol as the first character in the first 3 lines, e.g. '#'. That way these lines would be skipped.
However you are correct that the functionality differs from the excel import wizard, and I have created an internal ticket for the issue.

Regards,
Marco

frangonve · September 2013

Thanks for your answer,
Do you think that a new versión can change this behaviour?
Cheers

Francisco

Marco_Boeck · September 2013

Hi,

somewhere down the road, yes. However I fear it is not near the top of our priority list.

Regards,
Marco

MariusHelf · September 2013

Francisco,

in the 3rd step of the wizard you can specify the Comment annotation for the first 3 rows. That should get the work done. However I agree that using the selection as in the Excel operator is way more intuitive.

Best regards,
Marius

frangonve · September 2013

As the CSV files can't be edited manually, due to the amount of them, I can write an external macro to process them.

Thanks

Francisco

MariusHelf · September 2013

Francisco, what about the annotation approach outlined in my previous post? Does it not work?

JEdward · September 2013

Hi Marius,

I thought of the annotation method when reading this post the other day, but when trying it out found that it wasn't working on a simple CSV example with 3 lines at the beginning to ignore followed by a heading row. Do you get problems on a CSV containing the following data?

Some text here
Some text here
Some text here
Att1,Att2,Att3,Att4
1,Apple,6,8
2,Orange,4.4,3
3,Banana,2.2,11

MariusHelf · September 2013

Try the process below. The regular expression in replace attributes performs some magic to remove the first three rows of the data set.
Replace the Create Document operator with a Read Document operator to read a csv file from disk.

Concerning the problems with your example file I have created an internal ticket.

Best regards,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="Some text here&#10;Some text here&#10;Some text here&#10;Att1,Att2,Att3,Att4&#10;1,Apple,6,8&#10;2,Orange,4.4,3&#10;3,Banana,2.2,11&#10;"/>
      </operator>
      <operator activated="true" class="text:replace_tokens" compatibility="5.3.001" expanded="true" height="60" name="Replace Tokens" width="90" x="179" y="30">
        <list key="replace_dictionary">
          <parameter key="(?ms)([^\n]+)\n([^\n]+)\n([^\n]+)\n(.*)" value="$4"/>
        </list>
      </operator>
      <operator activated="true" class="text:combine_documents" compatibility="5.3.001" expanded="true" height="76" name="Combine Documents" width="90" x="313" y="30"/>
      <operator activated="true" class="text:write_document" compatibility="5.3.001" expanded="true" height="76" name="Write Document" width="90" x="447" y="30">
        <parameter key="file" value="C:\Users\mhelf\Desktop\test2.csv"/>
      </operator>
      <operator activated="false" class="text:process_documents" compatibility="5.3.001" expanded="true" height="76" name="Process Documents" width="90" x="179" y="165">
        <process expanded="true">
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="581" y="30">
        <parameter key="column_separators" value=","/>
        <parameter key="skip_comments" value="true"/>
        <list key="annotations"/>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
      <connect from_op="Replace Tokens" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
      <connect from_op="Combine Documents" from_port="document" to_op="Write Document" to_port="document"/>
      <connect from_op="Write Document" from_port="file" to_op="Read CSV" to_port="file"/>
      <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Getting rid of the first three lines of a csv file (wizard)

Answers