RapidMiner

Data exporting problem

Highlighted
Community Manager

Re: Data exporting problem

ok I have a better handle on it.  The issue is, as I suspected before, the column separator is not catching correctly.  RM is reading the intended column separator as "missing" rather than a recognizable unicode character.  I played around with RegEx expressions for a while to try to catch it but could not do so.  However I was able to create a hack that does basically the same thing:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
        <parameter key="csv_file" value="/Users/genzerconsulting/Desktop/test1.csv"/>
        <parameter key="column_separators" value="\n"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="7.5.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="default" value="value"/>
        <list key="columns"/>
        <parameter key="replenishment_value" value="&amp;&amp;&amp;&amp;&amp;"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="34">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="text" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents (2)" width="90" x="447" y="34"/>
      <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="581" y="34">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
      </operator>
      <operator activated="true" class="split" compatibility="7.5.003" expanded="true" height="82" name="Split" width="90" x="715" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="split_pattern" value="&amp;&amp;&amp;&amp;&amp;"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="7.5.003" expanded="true" height="82" name="Transpose" width="90" x="849" y="34"/>
      <connect from_op="Read CSV" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Data to Documents (2)" to_port="example set"/>
      <connect from_op="Data to Documents (2)" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
      <connect from_op="Combine Documents (2)" from_port="document" to_op="Documents to Data (2)" to_port="documents 1"/>
      <connect from_op="Documents to Data (2)" from_port="example set" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

It's not pretty but it does work.  Smiley Happy  I'll keep poking around to see if I can get to the bottom of it.  As mentioned earlier, fixing the encoding upstream so that it comes into RM nice and pretty would probably be a smoother way to go.

 

Scott

Regular Contributor

Re: Data exporting problem

Hi @sgenzer ,

 

That is a pretty impressive solution that you but together there. Still trying to figure out how it did it!! 

 

I think my very original problem can be explained as follows (correct me if I'm wrong)

 

1) A CSV file can only keep 32000 characters in a single cell (see my R code in the 9th post on this thread for a demonstration). This is an csv problem not a Rapidminer issue.

 

2) This is the main issue. The Execute Python operator is not able to handle blanklines (see my python example in my previous post for an example). This is a Rapidminer issue I think.  

 

3) I have so encoding issues in my data which has certainly not helped with the mess and may also be causing problems within Rapidminer operators. 

 

Is this a fair assesment of the issue so far? I would strongly urge you to check out my example that I have highlighted in point 2)

Community Manager

Re: Data exporting problem

thanks, @alan_jeffares - below is the same code with some annotations so you can see what I was doing with this "hack":

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="136">
        <parameter key="csv_file" value="/Users/genzerconsulting/Desktop/test1.csv"/>
        <parameter key="column_separators" value="\n"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" breakpoints="after" class="replace_missing_values" compatibility="7.5.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="179" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="default" value="value"/>
        <list key="columns"/>
        <parameter key="replenishment_value" value="&amp;&amp;&amp;&amp;&amp;"/>
        <description align="center" color="transparent" colored="false" width="126">take those rows that are marked missing (?) and replace them with &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp; as a placeholder</description>
      </operator>
      <operator activated="true" breakpoints="after" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="136">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="text" value="1.0"/>
        </list>
        <description align="center" color="transparent" colored="false" width="126">convert the dataset to a series of documents (one per row)</description>
      </operator>
      <operator activated="true" breakpoints="after" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents (2)" width="90" x="447" y="136">
        <description align="center" color="transparent" colored="false" width="126">combine all the documents into one</description>
      </operator>
      <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="581" y="136">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
        <description align="center" color="transparent" colored="false" width="126">put this one merged document back into a dataset with only one example and one attribute</description>
      </operator>
      <operator activated="true" breakpoints="after" class="split" compatibility="7.5.003" expanded="true" height="82" name="Split" width="90" x="715" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="split_pattern" value="&amp;&amp;&amp;&amp;&amp;"/>
        <description align="center" color="transparent" colored="false" width="126">split this one &amp;quot;cell&amp;quot; into 22 attributes using the &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp; placeholder as the marker</description>
      </operator>
      <operator activated="true" breakpoints="after" class="transpose" compatibility="7.5.003" expanded="true" height="82" name="Transpose" width="90" x="849" y="136">
        <description align="center" color="transparent" colored="false" width="126">flip the dataset around so that it has one attribute and 22 rows</description>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Data to Documents (2)" to_port="example set"/>
      <connect from_op="Data to Documents (2)" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
      <connect from_op="Combine Documents (2)" from_port="document" to_op="Documents to Data (2)" to_port="documents 1"/>
      <connect from_op="Documents to Data (2)" from_port="example set" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

As for the Python vs Execute Python issue, again I'm not a Python guy so I can't really go much further myself.  I will try to pass this along to folks higher up the food chain than I and see if we can get more info.

 

Scott

RMStaff

Re: Data exporting problem

Hi guys,

 

I still cannot identify the issue, sorry. I tried the textfile with blank lines, and I see no problem using read CSV, it treats several blank lines as one blank line.

 

Regarding the length of a csv file cell, there is no limit. I have provided an example that was over 150k characters and it works even with setting the polynominal type.

 

I've seen the test1.csv file, I opened it with my text editor, I cannot see where the separation should be, so I don't expect the operator to be able to do it.

 

So once again my conclusion is that the problem lies in the data preparation.

 

Kind regards,

Sebastian