Read CSV - How to use quotes as escape character for quotes

LearnAWLearnAW Member Posts: 4 Newbie
edited May 2020 in Help
I'm reading a series of CSV data files, comma-separated and using quotes.  Within a data line, if quotes are used in a field, it indicates that using double-quotes.  i.e., It essentially uses quotes as the escape character for quotes.  An example line could be:
  "News Alert","Mon, 13 May 2019 08:29:58","""NEWS OFFICE"" <newsoffice@spamdude.com>"
which it SHOULD interpret as 3 fields as follows: 
  (1) News Alert  (2) Mon, 13 May 2019 08:29:58 (3) "NEWS OFFICE" <newsoffice@spamdude.com>

I'm using the Read CSV operator, with "use quotes" checked and using quotes as both the quotes character and escape character.  The result is that it not only doesn't read the line correctly, it completely skips reading any line that has the double-quotes in it.  My operator XML is as follows:

          <operator activated="true" class="read_csv" compatibility="9.0.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
            <parameter key="column_separators" value=","/>
            <parameter key="escape_character" value="&quot;"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
            <parameter key="read_not_matching_values_as_missings" value="false"/>

Is there a way to do this so it reads and interprets my example line properly, or do I have to preprocess all my data files with a Python script or something similar to replace the double-quotes with some other escape character (like the default backslash), before ingesting to RapidMiner?  Thanks for the help!

Best Answers

  • Options
    LearnAWLearnAW Member Posts: 4 Newbie
    Solution Accepted
    Thanks Sebastian and Marco, I appreciate your perspectives.  And yes, I think I was assuming (or hoping for) a level of sophistication in parsing that I don't think Read CSV provides. (i.e. A precedence would have to exist, to first interpret a quote as the start of a string, and then any subsequent double quotes as an escaped quote.)  If it's treating the escape character the same across the entire input, then I see Marco's point how this can't work.  It looks like I have some preprocessing to do.  If anyone else knows a way to avoid this and read it correctly with just RapidMiner, please illuminate for me!  Thanks to all.
Sign In or Register to comment.