Options

"Write Database troubles"

colocolo Member Posts: 236 Maven
edited May 2019 in Help
Hi,

since the operator "Append" (for combining results from "Loop Examples") seems to need more memory than available, I changed the output strategy from a final big excel file to inremental database writing inside the loop. I took "Write Database" set overwrite mode to 'overwrite first, append then', configured the DB connection (mySQL), set the table name and started the process. There was one attribute ('description') generating too long content for varchar, so I converted this one to text and the output was starting. After writing some lines I got the next error message saying that a certain attribute value is too long for the respective column. I discovered that all the varchar columns were declared with the length of the first example's values. For this problem the 'set default varchar length' parameter offered a promising solution. So I set this default length to 250 and tried another run. Ooops, now the error message said that the 'description' attribute value was too long for the default varchar length. But this one should be of type text anyway. After setting the default length the value for 'table name' was cleared everytime I revisited the "Write Database" operator. I found no way of getting it back in there without removing the operator and adding a fresh one. So this whole thing seems a bit buggy, for the current run I am using type 'text' for all my attributes. But this solution isn't too satisfying... :(

Can someone confirm there problems? Any suggestions?
I am using the latest development version available through subversion.

Thanks and best regards,
Matthias
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thanks for the hint. I cannot reproduce the behavior of the operator with the current developer version, so you will have to wait for the next update. If it still occurs, please file bug.

    Greetings,
      Sebastian
  • Options
    adamanadaman Member Posts: 17 Contributor II
    I can confirm this bug also with the version 5.1.6

    Greetings
  • Options
    colocolo Member Posts: 236 Maven
    Hi,

    sorry for digging this old topic but the issue is related to the previous bug.

    The initial problem seems to be solved, but now where the parameter "set default varchar length" is working, it seems to be ignored. I just tried it a small process chain of "Generate Nominal Data" and "Write Database". It doesn't make any difference wheter the mentioned parameter is checked or not. The varchar columns always have the length of the longest example value for this column.
    Since I am picking up my old task and write single examples in a loop, I will face the same problems again. As soon as writing a value longer than the previous ones, the data won't fit in the column. It would be great if you could check this issue. I guess solving this isn't too hard, but if you experience the same problems, maybe I won't have to fix it by myself ;)

    Regards
    Matthias
  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,995 RM Engineering
    Hi Matthias,

    Problem confirmed and now fixed ;)
    I must say I love bugs which are so easy to fix.. :D
    Should be included in the next release.

    Regards,
    Marco
  • Options
    dan_agapedan_agape Member Posts: 106 Maven
    Hi Marco, Matthias, All,

    Marco, I wander if you've had a chance to have a look at how the parameter "set default varchar length" of Write Database operator currently works with other DBMSs, in particular PostgreSQL, after you fixed this. So far this operator has not properly worked with PostgreSQL for dataset attributes containing strings with more than 9 characters, in my processes. This problem was posted on the forum some time ago http://rapid-i.com/rapidforum/index.php/topic,2008.0.html . Unfortunately none of the solutions suggested there worked. Did RM's team tested the use of RM with PostgreSQL and were successful? Have other users perhaps employed RM with PostgreSQL successfully? There was another issue with the Stream Database operator used in conjunction with this DBMS - see past postings.

    Matthias (and other interested users), the question below is not related to the problem mentioned above, but is relevant to the performance of using databases with RM. I wander what performance you got when writing large datasets in your databases, if you could give us some input please. In the applications I have, writing data in a MySQL database was rather slow. Concretely, if I use a simple test process as the following, that generates a 5 million row dataset and writes it in a table, it takes about 55 min to complete (this time was achieved on two different quite resourceful machines, so there was no significant limitation from this point of view: local desktop - Win7 64bit, i7, 16GB - with RM and MySQL, and secondly with the first machine with RM and a database server - Linux, Xeon, 16GB - with MySQL).
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
       <process expanded="true" height="145" width="346">
         <operator activated="true" class="generate_churn_data" compatibility="5.1.008" expanded="true" height="60" name="Generate Churn Data" width="90" x="45" y="30">
           <parameter key="number_examples" value="5000000"/>
         </operator>
         <operator activated="true" class="write_database" compatibility="5.1.008" expanded="true" height="60" name="Write Database" width="90" x="179" y="30">
           <parameter key="connection" value="linserver_mysql"/>
           <parameter key="table_name" value="RM_testinsertspeed"/>
           <parameter key="overwrite_mode" value="overwrite first, append then"/>
         </operator>
         <connect from_op="Generate Churn Data" from_port="output" to_op="Write Database" to_port="input"/>
         <connect from_op="Write Database" from_port="through" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    I suggested a possible optimisation http://rapid-i.com/rapidforum/index.php/topic,3944.0.html but it would be interesting to see what other users experienced in terms of the database writing time for large datasets.

    Regards,
    Dan
  • Options
    colocolo Member Posts: 236 Maven
    Hi Dan,

    I'm afraid I can't give you some useful information. My field of work generates << 100k lines for each task and a single table. So I'm fine using a simple MySQL database. I didn't experience any problems in this dimension, but this is pretty far away from your amount of data.

    @Marco: Thanks for fixing the issue quickly. I also wish it always was that easy ;)

    Best regards
    Matthias
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Hi Dan,

    Yes, I can confirm similar timings to the ones you experienced.  Importing 15 million rows into a table from a csv file into a MySQL database took just over 3hrs. 
    God knows how long it'll take when I get to importing my 150+ million row tables!  :o
    (Yeah, Execute SQL operator combined with LOAD DATA LOCAL INFILE is probably wiser in my case)

    Your suggestion sounds like it would improve things, but I don't claim to be any kind of database expert so am also wondering what other forum members have experienced too. 

    Best regards,
    JEdward.

  • Options
    colocolo Member Posts: 236 Maven
    Hi Marco,

    I now experienced a new issue in the context of the problem fixed before.

    As far as I know the column types of the database tables are chosen corresponding to the attribute type in RapidMiner. Type text results in MEDIUMTEXT for a MySQL database. But when setting the default varchar length, the attributes are checked to have a length smaller than the given default length. This also affects columns not having type varchar. My example set contains an attribute of type text with longer contents. It is claimed to violate the length restriction but when cutting the content to a shorter value, it reveals that the column is generated as MEDIUMTEXT as it should be. So I guess the length checks have to be restricted to varchar columns only.

    This is the error message generated:
    com.rapidminer.operator.UserError: Database error occurred: Attribute Content contains values with length >255 which is the requested default varchar length.
    at com.rapidminer.operator.io.DatabaseExampleSetWriter.write(DatabaseExampleSetWriter.java:112)
    It would be nice if you had a look at this.

    Best regards
    Matthias
  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,995 RM Engineering
    Hi,

    please post a bug report here.
    That's something I'd rather not fix in a hurry..

    Regards,
    Marco
  • Options
    colocolo Member Posts: 236 Maven
    Hi Marco,

    a bug report is created. I will probably try to fix this or build a workaround by myself, since I will need this within the next two weeks. If I find a reasonable solution I will add this to the report.

    Best regards
    Matthias
  • Options
    dan_agapedan_agape Member Posts: 106 Maven
    Hi there,

    I was just wandering if Marco or anyone else from RM team had a chance to think to some requested answers regarding previously asked questions on RM+Postgresql.

    Thanks,
    Dan
    Did RM's team tested the use of RM with PostgreSQL and were successful? Have other users perhaps employed RM with PostgreSQL successfully? There was another issue with the Stream Database operator used in conjunction with this DBMS - see past postings.
  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,995 RM Engineering
    Hi dan_,

    I'm sorry I actually did overlook your previous post  :-[
    You are correct, there seems to be some kind of issue with PostgreSQL and RapidMiner, as my quick test of writing to a PostgreSQL db also failed for columns with >9 chars. We will look into the issue.

    Regards,
    Marco
Sign In or Register to comment.