"[SOLVED]questions of reading csv files"

huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
edited June 2019 in Help
When using “read csv” operator to import csv file, I have the following problem.

If a given cell has a “,”, the word following it will not be read.  I think this is because the “,” is used as the column operator. But for this case, “,” is just an character appeared in a string . How can I let rapidminer skip this “,” in the string.

The following is the test csv file, which just include one row with two columns. The main content is just a text string. in the gnerated wordlist, we can find that the word "what" was not read due to the "," appearing before it.
ID Text Field
1 wow <Content>, what charm!

The following is the process
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
 <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
       <process expanded="true">
     <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="165">
       <parameter key="csv_file" value="C:\Users\LocalRepository\Source_Data\test3.csv"/>
       <parameter key="column_separators" value=","/>
       <parameter key="first_row_as_names" value="false"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       <parameter key="encoding" value="GBK"/>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="ID.true.integer.id"/>
         <parameter key="1" value="Text Field.true.text.attribute"/>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
       <parameter key="vector_creation" value="Binary Term Occurrences"/>
       <list key="specify_weights"/>
       <process expanded="true">
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="120"/>
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (3)" width="90" x="313" y="120"/>
         <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (3)" width="90" x="447" y="120">
           <parameter key="min_chars" value="1"/>
           <parameter key="max_chars" value="200"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="255"/>
         <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
         <connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
         <connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
         <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
     <connect from_op="Read CSV" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>


  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering

    when ',' is not used to separate columns, you can simply change the column sparator in the operator parameters to the actual character that is used to separate columns in your data. In case you only have 1 column you can set it to whatever you like (but make sure it never appears in your text). If you have something like "my, text" , 123
    Then you can keep ',' as the column separater char, but you'd have to set " as the quotes char. Separater characters that appear in between quote characters are ignored and kept as part of the text.

  • Options
    huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
    Marco, Thanks.

    Suppose I have several columns, which still use “,” as column separator (because they are generated as csv file).  However,  within some cell entries, they include string like  ABC,DEF
    How to handle this kind of scenario?  Do I have to modify this csv file, and mark everything, like ABC,DEF  with “ABC,DEF”?
  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering

    yes. A csv file which contains , both as part of a string and as a separator char is syntactically invalid. It is impossible to read such a file without quote characters around the strings so that the parser knows what is a separator and what is part of a literal.

  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Actually, if the format of the file is only two columns e.g.

    1,wow <Content>, what charm!

    I think you might be able to read in the data using RegEx (certainly you could use RegEx & Notepad++ to clean it up also.
Sign In or Register to comment.