Options

Getting rid of the first three lines of a csv file (wizard)

frangonvefrangonve Member Posts: 5 Contributor II
edited June 2019 in Help
Hello,

The csv file to be imported  has three useless lines (rows) at the top.

In the fourth line attribute names are found.
From the fifth line onwards numeric values are found.

While I can select only  the useful data in the first window of the wizard with the mouse, this selection is not used by the wizzard, as next wizard windows still contain the useless data at the top and due to that I can't get the right attributes.

This behaviour contrasts with the excel wizard where I can effectively select and use the rows that I want.

Is there another way to get rid of these three first lines?

Cheers

Francisco
Tagged:

Answers

  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,995 RM Engineering
    Hi,

    you could manually add a comment symbol as the first character in the first 3 lines, e.g. '#'. That way these lines would be skipped.
    However you are correct that the functionality differs from the excel import wizard, and I have created an internal ticket for the issue.

    Regards,
    Marco
  • Options
    frangonvefrangonve Member Posts: 5 Contributor II
    Thanks for your answer,
    Do you think that a new versión can change this behaviour?
    Cheers

    Francisco
  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,995 RM Engineering
    Hi,

    somewhere down the road, yes. However I fear it is not near the top of our priority list.

    Regards,
    Marco
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Francisco,

    in the 3rd step of the wizard you can specify the Comment annotation for the first 3 rows. That should get the work done. However I agree that using the selection as in the Excel operator is way more intuitive.

    Best regards,
    Marius
  • Options
    frangonvefrangonve Member Posts: 5 Contributor II
    As the CSV files can't be edited manually, due to the amount of them, I can write an external macro to process them.

    Thanks

    Francisco


  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Francisco, what about the annotation approach outlined in my previous post? Does it not work?
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Hi Marius,

    I thought of the annotation method when reading this post the other day, but when trying it out found that it wasn't working on a simple CSV example with 3 lines at the beginning to ignore followed by a heading row.  Do you get problems on a CSV containing the following data?

    Some text here
    Some text here
    Some text here
    Att1,Att2,Att3,Att4
    1,Apple,6,8
    2,Orange,4.4,3
    3,Banana,2.2,11
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Try the process below. The regular expression in replace attributes performs some magic to remove the first three rows of the data set.
    Replace the Create Document operator with a Read Document operator to read a csv file from disk.

    Concerning the problems with your example file I have created an internal ticket.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="Some text here&#10;Some text here&#10;Some text here&#10;Att1,Att2,Att3,Att4&#10;1,Apple,6,8&#10;2,Orange,4.4,3&#10;3,Banana,2.2,11&#10;"/>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.001" expanded="true" height="60" name="Replace Tokens" width="90" x="179" y="30">
            <list key="replace_dictionary">
              <parameter key="(?ms)([^\n]+)\n([^\n]+)\n([^\n]+)\n(.*)" value="$4"/>
            </list>
          </operator>
          <operator activated="true" class="text:combine_documents" compatibility="5.3.001" expanded="true" height="76" name="Combine Documents" width="90" x="313" y="30"/>
          <operator activated="true" class="text:write_document" compatibility="5.3.001" expanded="true" height="76" name="Write Document" width="90" x="447" y="30">
            <parameter key="file" value="C:\Users\mhelf\Desktop\test2.csv"/>
          </operator>
          <operator activated="false" class="text:process_documents" compatibility="5.3.001" expanded="true" height="76" name="Process Documents" width="90" x="179" y="165">
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="581" y="30">
            <parameter key="column_separators" value=","/>
            <parameter key="skip_comments" value="true"/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
          <connect from_op="Combine Documents" from_port="document" to_op="Write Document" to_port="document"/>
          <connect from_op="Write Document" from_port="file" to_op="Read CSV" to_port="file"/>
          <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.