Data exporting problem

alan_jeffaresalan_jeffares Member Posts: 20 Contributor I
edited November 2018 in Help

 

Hi, 

 

Im having an issue passing data. It looks fine when in RM but when I export it goes a bit weird. 

 

My dataset has three attributes all with string values, with the first attribute having quite long string values ( 10000 words+). The dataset looks and works fine in rapidminer but when I try to export to csv some issues start to arise and after the 35th row it all falls apart. I think this problem is due to only being able to fit a certain amount of characters in one cell. 

 

This problem in itself is manageable, however, when I try to use the python operator on the dataset similar problems occur and from what I see in the log it would seem to be due to a similar problem (does the python operator export to csv internally?). 

 

So in summary I cannot use the Execute Python operator on my dataset. Am I missing something here? If not is there a workaround so I can operate on relatively big (not massive) bodies of text in a reasonably small dataset (2500 examples)?

 

Thanks

Alan

Answers

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Alan,

     

    I tried to reproduce your error, but it has worked for me.

     

    I generated an artificial dataset with R (size 240 mb):

     

    words <- paste(rep("Longword ", 10000), "")
    words <- Reduce(paste, words)
    cat1 <- c("A", "B", "C")
    cat2 <- 1:10

    df <- data.frame(words, sample(cat1, 2500, replace = T),
    sample(cat2, 2500, replace = T))

    colnames(df) <- paste("Att", 1:3)

    write.csv(df, "test.csv", row.names = F)

     

    I loaded up with RapidMiner and proceced it with Python:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\SebastianGolbert\Documents\R\test.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="447" y="34">
    <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; print('Hello, world!')&#10; # output can be found in Log View&#10; print(type(data))&#10;&#10; data.iloc[:, 2] = 999&#10;&#9;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    The process finishes and I can see the third attribute modified in RapidMiner:

     

    test.png

     

    Can you share your process to see what's wrong?

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi Alan,

     

    Just that I understand you correctly: The Operator Write CSV fails, right?

    In this case perhaps the value type of the column with the many words needs to be converted from Polynominal to Text.

     

    Best,

    Edin

     

    Thomas_Ott
  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @Edin_Klapic and @SGolbert ,

     

    Thanks for taking the time to respond. I have included the process in this post. 

     

    When I run the R generated data on my machine it works fine so it must be a local issue with my dataset. This gives me the impression that the issue may be on my end.

     

    As for the data type, the attribute of interest was defined as text and I've tried playing around with changing them but to no avail. Interestingly I noticed in the generated dataset (that works perfectly) the attribute is defined as polynominal. 

     

    One possible issue I was thinking is maybe there are some funky characters in my text that are causing the problem? Or its an error carried forward in the process? 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve language2" width="90" x="179" y="136">
    <parameter key="repository_entry" value="//Local Repository/data/language2"/>
    </operator>
    <operator activated="false" class="text_to_nominal" compatibility="7.5.003" expanded="true" height="82" name="Text to Nominal" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="136">
    <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; print('Hello, world!')&#10; # output can be found in Log View&#10; print(type(data))&#10;&#10; data.iloc[:, 2] = 999&#10;&#9;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <connect from_op="Retrieve language2" from_port="output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Heres the rest of the folder (could not fit it one one post)

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Hi Alan,

    looking at your input data, it seems something went wrong during the import process. If you've used the Read CSV operator try setting the encoding parameter to "UTF-8". Another option could be to load the csv file within the python script. Therefore use

    data = pandas.read_csv("/path/to/your/file.csv")

    and make sure the gibberish in the data isn't present anymore. BTW you can set an encoding while using the `pandas.read_csv` as well. Just add the parameter encoding="utf-8".

     

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @pschlunder,

     

    Thanks for the response. 

     

    So the problem is that this is a small process as part of a bigger pipeline so I'd prefer to pass the data through rapidminer and not be saving + reading documents from my local machine during the process. 

     

    The process I have attached is an example of what I am looking to do for this part of the pipeline. Basically loop through 4000 text files in a directory, convert to a single dataset, create/remove some attributes and then use the python operator. The problem occours at the end. If I view the output of the generate attributes operator everything is how I would expect, however as soon as I write csv or execute python, everything goes wrong. I have tried UTF-8 which does not fix unfortunately and even if it did I don't think that would solve the problem as I would still have to save a CSV on my local. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="concurrency:loop_files" compatibility="7.5.003" expanded="true" height="82" name="Loop Files" width="90" x="45" y="85">
    <parameter key="directory" value="C:\Users\alan.jeffares\Desktop\data\Last Version\Language_Detector"/>
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="85"/>
    <connect from_port="file object" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_port="output 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="179" y="85">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="85">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="text|metadata_file"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.5.003" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="85">
    <list key="function_descriptions">
    <parameter key="test" value="if(ends(metadata_file, &quot;en.txt&quot;), &quot;en&quot;, if(ends(metadata_file, &quot;es.txt&quot;), &quot;es&quot;, if(ends(metadata_file, &quot;de.txt&quot;), &quot;de&quot;, if(ends(metadata_file, &quot;fr.txt&quot;), &quot;fr&quot;, &quot;Unknown&quot;)) ))"/>
    </list>
    </operator>
    <operator activated="false" class="write_csv" compatibility="7.5.003" expanded="true" height="82" name="Write CSV" width="90" x="380" y="238">
    <parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test1.csv"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <operator activated="false" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="246" y="238">
    <parameter key="csv_file" value="C:\Users\alan.jeffares\Documents\test1.csv"/>
    <list key="annotations"/>
    <parameter key="encoding" value="UTF-8"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data (2)" to_port="documents 1"/>
    <connect from_op="Documents to Data (2)" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Thanks

    Alan

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    I've gone through the entire pipeline again and I really feel that the issue is coming down to the length of the string in a particular cell. Could that be the issue? 

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    ok i believe I have confirmed the issue. When dealing with csv files the limit to the number of characters in any given cell is 32,767. Thus in rapidminer when you try to write a csv with more than this many characters in a cell the resulting file is messed up. Now back to my problem of using the execute python on this data, i believe somewhere internally the data is saved to a csv within the operator so it is messed up when it comes out the other end. 

     

    I've edited the R script from earlier to demostrate this: 

    If you examine the resulting csv file and do a word count on any cell you will find they all have exactly 32759 characters where if it had saved correctly you would find there to be much more.

     

    rep <- rep("longword ", 10000)
    words1 <- paste(rep , " ")
    words1 <- Reduce(paste, words1)
    df <- data.frame(words1)
    df1 <- data.frame(words1)

    for (i in 1:10){
    df1 = rbind(df1 , df)

    }

    write.csv(df1, "test.csv", row.names = F)

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @alan_jeffares - I'm trying to view your process but it does not validate for me for some reason.  Could you please re-copy and paste (you can use the "insert code </>" icon for easier readability :)  ).

     

    Thanks.


    Scott

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Ah was looking for the insert code symbol!!

     

    I think i've narrowed down the problem better since that so I'll explain here: 

     

    I've attached a CSV file. If i open python and run this I have no problem

    import pandas
    data = pandas.read_csv("pathtofile\\test1.csv") #, encoding="utf-8")
    print(data)

    However if I use the execute python operator with the this line the problems occour in the output

    import pandas

    def rm_main():
    data = pandas.read_csv("pathtofile\\test1.csv", encoding="utf-8")

    return data

    I cannot figure it out! Also apologies for spamming the issue

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    no worries about the queries, @alan_jeffares.  That's what the forum is for.

     

    I ran your new python script and did not get any problems reading the csv file.  Here's my code:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34">
    <parameter key="script" value="import pandas&#10;&#10;def rm_main():&#10; data = pandas.read_csv(&quot;/Users/GenzerConsulting/Desktop/test1.csv&quot;, encoding=&quot;utf-8&quot;)&#10; &#10; return data"/>
    </operator>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    and here's my output:

     

    Screen Shot 2017-08-08 at 4.58.34 PM.png

     

    Question - have you ensured that RapidMiner is running Python properly?  Go to RapidMiner -> Preferences -> Python Scripting and use the "Test" button.  It looks like this:

     

    Screen Shot 2017-08-08 at 4.59.54 PM.png

     

    Scott

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @sgenzer , 

     

    What you've posted there is exactly my problem! There should only be 22 examples not 276! If you run the code on python that I sent it works fine and outputs 22 examples

     

    Python configuration is set up fine 

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    I can confirm that the length of the string is not an issue. If I generate an attribute with the formula:

     

    length([Att 1])

     

    I get the full values (159000 in my case, I used another dataset with longer words), even after executing python.

     


    There is for sure a import data along the pipeline that doesn't have the right encoding, that's why you see

     

    cinturón instead of cinturón

     

    órbita instead of órbita

     

    etc.

     

    Priority should be to get the encoding right.

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @SGolbert ,

     

    I'm a little bit confused by the solution. Just to go back to the problem in a simplified way:

     

    I have the text data file that I attached 4 posts back in the thread, if i run the python script on it the output is useable, I have a dataframe of size (22,1) which I can pass on for further operations. However if I run it in rapidminer the output is no longer useable it creates 276 examples that are no longer useable. 

     

    Are you saying that this is an issue on my end? Am I using the Python operator incorrectly? Or is this a limitation of Rapidminer?

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Alan,

     

    I think is neither. The dataset "language2" has some problems by itself. It doesn't seem to have been loaded up correctly (probably due to wrong encoding). And once you load up a file with the wrong encoding, then the non-ANSI character information is lost (i.e. you don't know whether it is á, é. í, ó or ú).

     

    So I think that the error is somewhere "upstream" in your workflow. You can specify the encoding both in Python and RapidMiner, you would have to isolate the part of the workflow that loads or gets the data and see if the encoding is right.

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @SGolbert ,

     

    I'll have a look and see if I can sort the encoding issues, but I'm still a bit confused as to why it causes so much issue in my rapidminer workflow. I mean if rapidminer was to interpret 'á' as 'a' or as something else it should still be possible to tokenize and build a classifer on. As we see in the python output some of the letters are interpreted wrong but its still 22 rows of data and 1 column. However in rapidminer, on the same dataset, due to certain letters being interpreted wrongly my data set becomes 262 lines of garbage that is no longert useful? I mean when your webmining its likely that you will find all sorts of issues like this but surely it shouldnt have such a terminal effect on the pipeline.

     

    Alan 

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    On the contrary, the result should be terminal. In my opinion it would have been better if your process failed earlier, prompting you to specify the encoding.

     

    If you get a more or less acceptable result, it is only by chance. Imagine that your source data was in Chinese, if you don't specify the correct encoding, there is no way that the process would work. The same applies for any web scrapping, you have to have a way of determining the encoding (which could be automatic, I haven't done it yet).

     

    And also, "á" doesn't translate to "a", it changes to something like "#Character_missmatch" (a system dependent symbol in general, should be always the same in Java) . 

     

    As an example, try to read the attached file with different encodings, the results are diametrically different.

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    @SGolbert ,

     

    Have a look at this example! I create a text document manually and write three lines into it. I then save it to my machine and load it in via Rapidminer (Read Document). I then use Documents to Data to make a dataframe of one cell. Finally I put it through Execute Python (doing nothing) and all of a sudden its two cells! I think the blanklines messes up the operator?

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="85">
    <parameter key="file" value="C:\Users\alan.jeffares\Desktop\folder\texttest.txt"/>
    </operator>
    <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="246" y="85">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="false"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="380" y="85">
    <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; &#10; return data"/>
    </operator>
    <connect from_op="Read Document (2)" from_port="output" to_op="Documents to Data (2)" to_port="documents 1"/>
    <connect from_op="Documents to Data (2)" from_port="example set" to_op="Execute Python (2)" to_port="input 1"/>
    <connect from_op="Execute Python (2)" from_port="output 1" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="61" resized="true" width="184" x="212" y="181">We read in a text file and put the string in one cell</description>
    <description align="center" color="purple" colored="true" height="105" resized="false" width="180" x="498" y="131">However after going through the Execute Python operator it is now split over two cells</description>
    </process>
    </operator>
    </process>

     Note:

    I cannot attach a text file so just create your own with the following text

    this is text followed by blank lines

    this is more text


    and even more here
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Good morning @SGolbert,

     

    Ok back on this puzzle.  :)  Just going back to the test1.csv file and the 22 vs 276 rows issue, I found pretty quickly that if you save the test1.csv file as an xlsx file (in Excel), and then just use the Read Excel operator instead of Read CSV, you import the 22 rows with no problem:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="45" y="34">
    <parameter key="script" value="import pandas&#10;&#10;def rm_main():&#10; data = pandas.read_csv(&quot;/Users/GenzerConsulting/Desktop/test1.csv&quot;, encoding=&quot;utf-8&quot;)&#10; &#10; return data"/>
    </operator>
    <operator activated="false" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="136">
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="238">
    <parameter key="excel_file" value="/Users/genzerconsulting/Desktop/test1.xlsx"/>
    <parameter key="imported_cell_range" value="A1:A23"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="text.true.polynominal.attribute"/>
    </list>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    You still have some funky accented character issues which are due to encoding issues as said earlier but I think it's clear that a key problem is that the default RapidMiner csv row parsing (escape character) is not cutting off in the right places for your file.  I am not a Python programmer but I will play a bit more with RapidMiner's built-in features to see if I can find the right RegEx for this file:

     

    Screen Shot 2017-08-10 at 10.58.52 AM.png

     

    This has happened to me many many times.  Stay tuned...

     

    Scott

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ok I have a better handle on it.  The issue is, as I suspected before, the column separator is not catching correctly.  RM is reading the intended column separator as "missing" rather than a recognizable unicode character.  I played around with RegEx expressions for a while to try to catch it but could not do so.  However I was able to create a hack that does basically the same thing:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="/Users/genzerconsulting/Desktop/test1.csv"/>
    <parameter key="column_separators" value="\n"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="replace_missing_values" compatibility="7.5.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="default" value="value"/>
    <list key="columns"/>
    <parameter key="replenishment_value" value="&amp;&amp;&amp;&amp;&amp;"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="34">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="text" value="1.0"/>
    </list>
    </operator>
    <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents (2)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="581" y="34">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="false"/>
    </operator>
    <operator activated="true" class="split" compatibility="7.5.003" expanded="true" height="82" name="Split" width="90" x="715" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="split_pattern" value="&amp;&amp;&amp;&amp;&amp;"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.5.003" expanded="true" height="82" name="Transpose" width="90" x="849" y="34"/>
    <connect from_op="Read CSV" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Data to Documents (2)" to_port="example set"/>
    <connect from_op="Data to Documents (2)" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
    <connect from_op="Combine Documents (2)" from_port="document" to_op="Documents to Data (2)" to_port="documents 1"/>
    <connect from_op="Documents to Data (2)" from_port="example set" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    It's not pretty but it does work.  :)  I'll keep poking around to see if I can get to the bottom of it.  As mentioned earlier, fixing the encoding upstream so that it comes into RM nice and pretty would probably be a smoother way to go.

     

    Scott

  • alan_jeffaresalan_jeffares Member Posts: 20 Contributor I

    Hi @sgenzer ,

     

    That is a pretty impressive solution that you but together there. Still trying to figure out how it did it!! 

     

    I think my very original problem can be explained as follows (correct me if I'm wrong)

     

    1) A CSV file can only keep 32000 characters in a single cell (see my R code in the 9th post on this thread for a demonstration). This is an csv problem not a Rapidminer issue.

     

    2) This is the main issue. The Execute Python operator is not able to handle blanklines (see my python example in my previous post for an example). This is a Rapidminer issue I think.  

     

    3) I have so encoding issues in my data which has certainly not helped with the mess and may also be causing problems within Rapidminer operators. 

     

    Is this a fair assesment of the issue so far? I would strongly urge you to check out my example that I have highlighted in point 2)

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    thanks, @alan_jeffares - below is the same code with some annotations so you can see what I was doing with this "hack":

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="136">
    <parameter key="csv_file" value="/Users/genzerconsulting/Desktop/test1.csv"/>
    <parameter key="column_separators" value="\n"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" breakpoints="after" class="replace_missing_values" compatibility="7.5.003" expanded="true" height="103" name="Replace Missing Values" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="default" value="value"/>
    <list key="columns"/>
    <parameter key="replenishment_value" value="&amp;&amp;&amp;&amp;&amp;"/>
    <description align="center" color="transparent" colored="false" width="126">take those rows that are marked missing (?) and replace them with &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp; as a placeholder</description>
    </operator>
    <operator activated="true" breakpoints="after" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="136">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="text" value="1.0"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">convert the dataset to a series of documents (one per row)</description>
    </operator>
    <operator activated="true" breakpoints="after" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents (2)" width="90" x="447" y="136">
    <description align="center" color="transparent" colored="false" width="126">combine all the documents into one</description>
    </operator>
    <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="581" y="136">
    <parameter key="text_attribute" value="text"/>
    <parameter key="add_meta_information" value="false"/>
    <description align="center" color="transparent" colored="false" width="126">put this one merged document back into a dataset with only one example and one attribute</description>
    </operator>
    <operator activated="true" breakpoints="after" class="split" compatibility="7.5.003" expanded="true" height="82" name="Split" width="90" x="715" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="split_pattern" value="&amp;&amp;&amp;&amp;&amp;"/>
    <description align="center" color="transparent" colored="false" width="126">split this one &amp;quot;cell&amp;quot; into 22 attributes using the &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp; placeholder as the marker</description>
    </operator>
    <operator activated="true" breakpoints="after" class="transpose" compatibility="7.5.003" expanded="true" height="82" name="Transpose" width="90" x="849" y="136">
    <description align="center" color="transparent" colored="false" width="126">flip the dataset around so that it has one attribute and 22 rows</description>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Data to Documents (2)" to_port="example set"/>
    <connect from_op="Data to Documents (2)" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
    <connect from_op="Combine Documents (2)" from_port="document" to_op="Documents to Data (2)" to_port="documents 1"/>
    <connect from_op="Documents to Data (2)" from_port="example set" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    As for the Python vs Execute Python issue, again I'm not a Python guy so I can't really go much further myself.  I will try to pass this along to folks higher up the food chain than I and see if we can get more info.

     

    Scott

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi guys,

     

    I still cannot identify the issue, sorry. I tried the textfile with blank lines, and I see no problem using read CSV, it treats several blank lines as one blank line.

     

    Regarding the length of a csv file cell, there is no limit. I have provided an example that was over 150k characters and it works even with setting the polynominal type.

     

    I've seen the test1.csv file, I opened it with my text editor, I cannot see where the separation should be, so I don't expect the operator to be able to do it.

     

    So once again my conclusion is that the problem lies in the data preparation.

     

    Kind regards,

    Sebastian

Sign In or Register to comment.