RapidMiner

Data exporting problem

Regular Contributor

Data exporting problem

 

Hi, 

 

Im having an issue passing data. It looks fine when in RM but when I export it goes a bit weird. 

 

My dataset has three attributes all with string values, with the first attribute having quite long string values ( 10000 words+). The dataset looks and works fine in rapidminer but when I try to export to csv some issues start to arise and after the 35th row it all falls apart. I think this problem is due to only being able to fit a certain amount of characters in one cell. 

 

This problem in itself is manageable, however, when I try to use the python operator on the dataset similar problems occur and from what I see in the log it would seem to be due to a similar problem (does the python operator export to csv internally?). 

 

So in summary I cannot use the Execute Python operator on my dataset. Am I missing something here? If not is there a workaround so I can operate on relatively big (not massive) bodies of text in a reasonably small dataset (2500 examples)?

 

Thanks

Alan

23 REPLIES
RMStaff

Re: Data exporting problem

Hi Alan,

 

I tried to reproduce your error, but it has worked for me.

 

I generated an artificial dataset with R (size 240 mb):

 

words <- paste(rep("Longword ", 10000), "")
words <- Reduce(paste, words)
cat1 <- c("A", "B", "C")
cat2 <- 1:10

df <- data.frame(words, sample(cat1, 2500, replace = T), 
                 sample(cat2, 2500, replace = T))

colnames(df) <- paste("Att", 1:3)

write.csv(df, "test.csv", row.names = F)

 

I loaded up with RapidMiner and proceced it with Python:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
        <parameter key="csv_file" value="C:\Users\SebastianGolbert\Documents\R\test.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="447" y="34">
        <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;    print('Hello, world!')&#10;    # output can be found in Log View&#10;    print(type(data))&#10;&#10;    data.iloc[:, 2] = 999&#10;&#9;&#10;    # connect 2 output ports to see the results&#10;    return data"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

 

The process finishes and I can see the third attribute modified in RapidMiner:

 

test.png

 

Can you share your process to see what's wrong?

RMStaff

Re: Data exporting problem

Hi Alan,

 

Just that I understand you correctly: The Operator Write CSV fails, right?

In this case perhaps the value type of the column with the many words needs to be converted from Polynominal to Text.

 

Best,

Edin

 

Regular Contributor

Re: Data exporting problem

[ Edited ]

Hi @Edin_Klapic and @SGolbert ,

 

Thanks for taking the time to respond. I have included the process in this post. 

 

When I run the R generated data on my machine it works fine so it must be a local issue with my dataset. This gives me the impression that the issue may be on my end.

 

As for the data type, the attribute of interest was defined as text and I've tried playing around with changing them but to no avail. Interestingly I noticed in the generated dataset (that works perfectly) the attribute is defined as polynominal. 

 

One possible issue I was thinking is maybe there are some funky characters in my text that are causing the problem? Or its an error carried forward in the process? 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve language2" width="90" x="179" y="136">
<parameter key="repository_entry" value="//Local Repository/data/language2"/>
</operator>
<operator activated="false" class="text_to_nominal" compatibility="7.5.003" expanded="true" height="82" name="Text to Nominal" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="text"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="136">
<parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; print('Hello, world!')&#10; # output can be found in Log View&#10; print(type(data))&#10;&#10; data.iloc[:, 2] = 999&#10;&#9;&#10; # connect 2 output ports to see the results&#10; return data"/>
</operator>
<connect from_op="Retrieve language2" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Attachments

Regular Contributor

Re: Data exporting problem

Heres the rest of the folder (could not fit it one one post)

Attachments

RMStaff

Re: Data exporting problem

[ Edited ]

Hi Alan,

looking at your input data, it seems something went wrong during the import process. If you've used the Read CSV operator try setting the encoding parameter to "UTF-8". Another option could be to load the csv file within the python script. Therefore use

data = pandas.read_csv("/path/to/your/file.csv")

and make sure the gibberish in the data isn't present anymore. BTW you can set an encoding while using the `pandas.read_csv` as well. Just add the parameter encoding="utf-8".

 

Regular Contributor

Re: Data exporting problem

Hi @pschlunder,

 

Thanks for the response. 

 

So the problem is that this is a small process as part of a bigger pipeline so I'd prefer to pass the data through rapidminer and not be saving + reading documents from my local machine during the process. 

 

The process I have attached is an example of what I am looking to do for this part of the pipeline. Basically loop through 4000 text files in a directory, convert to a single dataset, create/remove some attributes and then use the python operator. The problem occours at the end. If I view the output of the generate attributes operator everything is how I would expect, however as soon as I write csv or execute python, everything goes wrong. I have tried UTF-8 which does not fix unfortunately and even if it did I don't think that would solve the problem as I would still have to save a CSV on my local. 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="7.5.003" expanded="true" height="82" name="Loop Files" width="90" x="45" y="85">
<parameter key="directory" value="C:\Users\alan.jeffares\Desktop\data\Last Version\Language_Detector"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="85"/>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="179" y="85">
<parameter key="text_attribute" value="text"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="85">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="text|metadata_file"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.003" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="85">
<list key="function_descriptions">
<parameter key="test" value="if(ends(metadata_file, &quot;en.txt&quotSmiley Wink, &quot;en&quot;, if(ends(metadata_file, &quot;es.txt&quotSmiley Wink, &quot;es&quot;, if(ends(metadata_file, &quot;de.txt&quotSmiley Wink, &quot;de&quot;, if(ends(metadata_file, &quot;fr.txt&quotSmiley Wink, &quot;fr&quot;, &quot;Unknown&quotSmiley Wink) ))"/>
</list>
</operator>
<operator activated="false" class="write_csv" compatibility="7.5.003" expanded="true" height="82" name="Write CSV" width="90" x="380" y="238">
<parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test1.csv"/>
<parameter key="encoding" value="UTF-8"/>
</operator>
<operator activated="false" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="246" y="238">
<parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test1.csv"/>
<list key="annotations"/>
<parameter key="encoding" value="UTF-8"/>
<list key="data_set_meta_data_information"/>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data (2)" to_port="documents 1"/>
<connect from_op="Documents to Data (2)" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

Thanks

Alan

Regular Contributor

Re: Data exporting problem

I've gone through the entire pipeline again and I really feel that the issue is coming down to the length of the string in a particular cell. Could that be the issue? 

Regular Contributor

Re: Data exporting problem

ok i believe I have confirmed the issue. When dealing with csv files the limit to the number of characters in any given cell is 32,767. Thus in rapidminer when you try to write a csv with more than this many characters in a cell the resulting file is messed up. Now back to my problem of using the execute python on this data, i believe somewhere internally the data is saved to a csv within the operator so it is messed up when it comes out the other end. 

 

I've edited the R script from earlier to demostrate this: 

If you examine the resulting csv file and do a word count on any cell you will find they all have exactly 32759 characters where if it had saved correctly you would find there to be much more.

 

rep <- rep("longword ", 10000)
words1 <- paste(rep , " ")
words1 <- Reduce(paste, words1)
df <- data.frame(words1)
df1 <- data.frame(words1)

for (i in 1:10){
df1 = rbind(df1 , df)

}

write.csv(df1, "test.csv", row.names = F)

Community Manager

Re: Data exporting problem

hi @alan_jeffares - I'm trying to view your process but it does not validate for me for some reason.  Could you please re-copy and paste (you can use the "insert code </>" icon for easier readability Smiley Happy  ).

 

Thanks.


Scott