RapidMiner

Encoding problem

Regular Contributor

Encoding problem

This is a follow up from this conversation addressing a slightly different problem which had become quite messy.

 

There were a few problems pinpointed but the one that is causing the most trouble is the encoding problem. So here it is:

I loop through a number of text files (a sample is attached below) and convert them to a dataset which looks fine. However due to the encoding problem when I write and read a CSV file everything gets messed up. I have done some trial and error but cannot find a fix to the problem and this is as upstream as I can possibly go. Here is the XML:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="7.5.003" expanded="true" height="82" name="Loop Files" width="90" x="112" y="187">
        <parameter key="directory" value="C:\Users\alan.jeffares\Desktop\data2"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="34">
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="246" y="187">
        <parameter key="text_attribute" value="textspecial"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="7.5.003" expanded="true" height="82" name="Write CSV" width="90" x="447" y="187">
        <parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test3.csv"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="581" y="187">
        <parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test3.csv"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Write CSV" to_port="input"/>
      <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Attachments

6 REPLIES
FBT
Super Contributor

Re: Encoding problem

I have tried your process and what seems to throw everything of is the column separator in the "Write CSV" operator. Due to the structure of your input data, I don't believe this can be solved with the "Write CSV" operator. However, I have replaced the "Write CSV" and "Read CSV" operators with their corresponding Excel versions  and things seem to look as they are supposed to. 

 

Try this and see if it behaves as you want:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="7.5.003" expanded="true" height="82" name="Loop Files" width="90" x="112" y="187">
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="34">
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="246" y="187">
        <parameter key="text_attribute" value="textspecial"/>
      </operator>
      <operator activated="true" class="write_excel" compatibility="7.5.003" expanded="true" height="82" name="Write Excel" width="90" x="447" y="289"/>
      <operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel" width="90" x="581" y="289">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Read Excel" from_port="output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

You will need to add the file paths, etc., for it to work.

Regular Contributor

Re: Encoding problem

Hi @FBT ,

 

That solution does indeed work for this case but in order to use certain operators (eg execute python) I need to figure out the encoding issue

 

 

FBT
Super Contributor

Re: Encoding problem

Ok, I see. And, after reading through the old thread in detail, I totally understand your frustration. Also, apologies for suggesting a solution that was already proposed earlier by @sgenzer.

 

I have tried a couple of things with the CSV operators, but they just don't do what they are supposed to do. More precisely, the "Write CSV" is behaving strangely. I am not even sure any longer that this is caused by the CSV column separator, as I have converted the data to types in which it shouldn't matter anymore and played around with different separators. Also, encoding does not seem to be the issue here. I have tried those that correspond to your input files and also a few more, without any changes in the result of "Write CSV".

 

Maybe @SGolbert can take a look at your process and input files at the top of this post and confirm that this is not a bug.

 

Not sure if it works for you, but if it does, the easiest and cleanest way may be to deal with it using the "Write Database" and "Read Database" operators. It's a bit more cumbersome than using the CSV operators, but it works as expected and places the data in the correct columns without breaking it up.

 

 

Highlighted
Elite III

Re: Encoding problem

I'm not making this too simple, but isn't the problem that you have line characters in your text and so when the CSV is looping across them it's thinking that a new line is a new record?  

 

Try adding Remove Document Parts to get rid of those naughty new line characters. 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="7.5.003" expanded="true" height="82" name="Loop Files" width="90" x="112" y="187">
        <parameter key="directory" value="C:\Users\Administrator\Downloads\data2"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="34"/>
          <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts" width="90" x="447" y="34">
            <parameter key="deletion_regex" value="\r"/>
          </operator>
          <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (2)" width="90" x="648" y="34">
            <parameter key="deletion_regex" value="\n"/>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Remove Document Parts" to_port="document"/>
          <connect from_op="Remove Document Parts" from_port="document" to_op="Remove Document Parts (2)" to_port="document"/>
          <connect from_op="Remove Document Parts (2)" from_port="document" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="246" y="187">
        <parameter key="text_attribute" value="textspecial"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="7.5.003" expanded="true" height="82" name="Write CSV" width="90" x="447" y="187">
        <parameter key="csv_file" value="C:\Users\Administrator\Downloads\ifixthis\test3.csv"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="715" y="34">
        <parameter key="csv_file" value="C:\Users\Administrator\Downloads\ifixthis\test3.csv"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="myData.true.nominal.regular"/>
        </list>
      </operator>
      <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Write CSV" to_port="input"/>
      <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

 

If your data is already saved as CSV then you should still be able to split them up using the Read Document operator and use the Cut Documents operator to split them. 

-- Training, Consulting, Sales in China, Hong Kong & Taiwan --
www.RapidMinerChina.com
RMStaff

Re: Encoding problem

I will try to find a solution, but before that let me point out that it is not a good idea to bring the data to a tabular form.

 

I think that the right approach would be to bring the data into XML/JSON or a NoSQL database. I am a total beginner in that area, so input from an experienced user will be welcomed Smiley Happy

RMStaff

Re: Encoding problem

Hi,

you should be able to use an embedded H2 database. It is a simple to use SQL database that can store any example set from RapidMiner. I used it with huge example sets (both huge numbers of rows and columns). But it's still only a directory with a few files on your hard disk, instead of the headache of maintaining a full relational database management system.

There are situations which call for a NoSQL database, but this is not one of them ;-)

 

Regards,

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at