RapidMiner

Data exporting problem

Regular Contributor

Re: Data exporting problem

[ Edited ]

Ah was looking for the insert code symbol!!

 

I think i've narrowed down the problem better since that so I'll explain here: 

 

I've attached a CSV file. If i open python and run this I have no problem

import pandas
data = pandas.read_csv("pathtofile\\test1.csv") #,  encoding="utf-8")
print(data)

However if I use the execute python operator with the this line the problems occour in the output

import pandas

def rm_main():
    data = pandas.read_csv("pathtofile\\test1.csv",  encoding="utf-8")
   
    return data

I cannot figure it out! Also apologies for spamming the issue

 

Attachments

Community Manager

Re: Data exporting problem

no worries about the queries, @alan_jeffares.  That's what the forum is for.

 

I ran your new python script and did not get any problems reading the csv file.  Here's my code:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34">
        <parameter key="script" value="import pandas&#10;&#10;def rm_main():&#10;    data = pandas.read_csv(&quot;/Users/GenzerConsulting/Desktop/test1.csv&quot;,  encoding=&quot;utf-8&quot;)&#10;   &#10;    return data"/>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

and here's my output:

 

Screen Shot 2017-08-08 at 4.58.34 PM.png

 

Question - have you ensured that RapidMiner is running Python properly?  Go to RapidMiner -> Preferences -> Python Scripting and use the "Test" button.  It looks like this:

 

Screen Shot 2017-08-08 at 4.59.54 PM.png

 

Scott

Regular Contributor

Re: Data exporting problem

Hi @sgenzer , 

 

What you've posted there is exactly my problem! There should only be 22 examples not 276! If you run the code on python that I sent it works fine and outputs 22 examples

 

Python configuration is set up fine 

RMStaff

Re: Data exporting problem

I can confirm that the length of the string is not an issue. If I generate an attribute with the formula:

 

length([Att 1])

 

I get the full values (159000 in my case, I used another dataset with longer words), even after executing python.

 


There is for sure a import data along the pipeline that doesn't have the right encoding, that's why you see

 

cinturón instead of cinturón

 

órbita instead of órbita

 

etc.

 

Priority should be to get the encoding right.

Regular Contributor

Re: Data exporting problem

Hi @SGolbert ,

 

I'm a little bit confused by the solution. Just to go back to the problem in a simplified way:

 

I have the text data file that I attached 4 posts back in the thread, if i run the python script on it the output is useable, I have a dataframe of size (22,1) which I can pass on for further operations. However if I run it in rapidminer the output is no longer useable it creates 276 examples that are no longer useable. 

 

Are you saying that this is an issue on my end? Am I using the Python operator incorrectly? Or is this a limitation of Rapidminer?

RMStaff

Re: Data exporting problem

Hi Alan,

 

I think is neither. The dataset "language2" has some problems by itself. It doesn't seem to have been loaded up correctly (probably due to wrong encoding). And once you load up a file with the wrong encoding, then the non-ANSI character information is lost (i.e. you don't know whether it is á, é. í, ó or ú).

 

So I think that the error is somewhere "upstream" in your workflow. You can specify the encoding both in Python and RapidMiner, you would have to isolate the part of the workflow that loads or gets the data and see if the encoding is right.

Regular Contributor

Re: Data exporting problem

Hi @SGolbert ,

 

I'll have a look and see if I can sort the encoding issues, but I'm still a bit confused as to why it causes so much issue in my rapidminer workflow. I mean if rapidminer was to interpret 'á' as 'a' or as something else it should still be possible to tokenize and build a classifer on. As we see in the python output some of the letters are interpreted wrong but its still 22 rows of data and 1 column. However in rapidminer, on the same dataset, due to certain letters being interpreted wrongly my data set becomes 262 lines of garbage that is no longert useful? I mean when your webmining its likely that you will find all sorts of issues like this but surely it shouldnt have such a terminal effect on the pipeline.

 

Alan 

RMStaff

Re: Data exporting problem

On the contrary, the result should be terminal. In my opinion it would have been better if your process failed earlier, prompting you to specify the encoding.

 

If you get a more or less acceptable result, it is only by chance. Imagine that your source data was in Chinese, if you don't specify the correct encoding, there is no way that the process would work. The same applies for any web scrapping, you have to have a way of determining the encoding (which could be automatic, I haven't done it yet).

 

And also, "á" doesn't translate to "a", it changes to something like "#Character_missmatch" (a system dependent symbol in general, should be always the same in Java) . 

 

As an example, try to read the attached file with different encodings, the results are diametrically different.

Attachments

Regular Contributor

Re: Data exporting problem

@SGolbert ,

 

Have a look at this example! I create a text document manually and write three lines into it. I then save it to my machine and load it in via Rapidminer (Read Document). I then use Documents to Data to make a dataframe of one cell. Finally I put it through Execute Python (doing nothing) and all of a sudden its two cells! I think the blanklines messes up the operator?

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="85">
        <parameter key="file" value="C:\Users\alan.jeffares\Desktop\folder\texttest.txt"/>
      </operator>
      <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="246" y="85">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="380" y="85">
        <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;  &#10;    return data"/>
      </operator>
      <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/>
      <connect from_op="Documents to Data (2)" from_port="example set" to_op="Execute Python (2)" to_port="input 1"/>
      <connect from_op="Execute Python (2)" from_port="output 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <description align="center" color="yellow" colored="false" height="61" resized="true" width="184" x="212" y="181">We read in a text file and put the string in one cell</description>
      <description align="center" color="purple" colored="true" height="105" resized="false" width="180" x="498" y="131">However after going through the Execute Python operator it is now split over two cells</description>
    </process>
  </operator>
</process>

 Note:

I cannot attach a text file so just create your own with the following text

this is text followed by blank lines

this is more text


and even more here
Community Manager

Re: Data exporting problem

Good morning @SGolbert,

 

Ok back on this puzzle.  Smiley Happy  Just going back to the test1.csv file and the 22 vs 276 rows issue, I found pretty quickly that if you save the test1.csv file as an xlsx file (in Excel), and then just use the Read Excel operator instead of Read CSV, you import the 22 rows with no problem:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="45" y="34">
        <parameter key="script" value="import pandas&#10;&#10;def rm_main():&#10;    data = pandas.read_csv(&quot;/Users/GenzerConsulting/Desktop/test1.csv&quot;,  encoding=&quot;utf-8&quot;)&#10;   &#10;    return data"/>
      </operator>
      <operator activated="false" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="136">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="238">
        <parameter key="excel_file" value="/Users/genzerconsulting/Desktop/test1.xlsx"/>
        <parameter key="imported_cell_range" value="A1:A23"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="text.true.polynominal.attribute"/>
        </list>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

You still have some funky accented character issues which are due to encoding issues as said earlier but I think it's clear that a key problem is that the default RapidMiner csv row parsing (escape character) is not cutting off in the right places for your file.  I am not a Python programmer but I will play a bit more with RapidMiner's built-in features to see if I can find the right RegEx for this file:

 

Screen Shot 2017-08-10 at 10.58.52 AM.png

 

This has happened to me many many times.  Stay tuned...

 

Scott