Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
number of rows in write database operator
Hi,
Perhaps this has been considered, but just in case:
Having tried to write large scored example sets in a database using the write database operator, took quite long in some application I have. For instance writing 5 million rows in a MySQL database took about one hour. In addition to tuning the database itself, some optimisation on RM's side may be possible and very welcome. For instance writing an example set in a database table may be speeded up if rows/examples are written in groups (i.e. each group of examples is written via the same INSERT SQL command). If there is one row per INSERT only, time for connection, time for sending the query to the database server, and time for parsing the query are unnecessarily spent. If several rows are grouped per one INSERT command, comparable times are spent per group of rows instead of per individual row.
So it would be useful to add a parameter "number of rows" to the write database operator showing how many rows are to be grouped per INSERT command. If some DBMSs do not support several rows per INSERT, then at least this number can consist of how many individual INSERT commands (with one row per command) are to be grouped in the same SQL transaction, so anyway an optimisation would result.
Dan
Perhaps this has been considered, but just in case:
Having tried to write large scored example sets in a database using the write database operator, took quite long in some application I have. For instance writing 5 million rows in a MySQL database took about one hour. In addition to tuning the database itself, some optimisation on RM's side may be possible and very welcome. For instance writing an example set in a database table may be speeded up if rows/examples are written in groups (i.e. each group of examples is written via the same INSERT SQL command). If there is one row per INSERT only, time for connection, time for sending the query to the database server, and time for parsing the query are unnecessarily spent. If several rows are grouped per one INSERT command, comparable times are spent per group of rows instead of per individual row.
So it would be useful to add a parameter "number of rows" to the write database operator showing how many rows are to be grouped per INSERT command. If some DBMSs do not support several rows per INSERT, then at least this number can consist of how many individual INSERT commands (with one row per command) are to be grouped in the same SQL transaction, so anyway an optimisation would result.
Dan
0
Answers
For users that want to employ this, see the "batch size" parameter, whose value is the maximum number of rows to write at once in the database via the same INSERT command. The feature considerably increases the speed of writing a large/very many row dataset in a database table.
Dan
When this parameter is set to anything but 1, I get this error message: "Database error occurred: ORA-00933: SQL command not properly ended".
It seems the code might not add the semi-colons at the end of the lines or something.
Great work with this tool though!
I successfully tested the feature for MySQL. It would be expected for some DBMS the feature not to work for now,
but I guess the RM team will fix it for Oracle (or other very popular DBMS if needed) at some point.
Anyway this feature is a relief for the guys working with very large databases (so I guess other optimisations as suggested above may follow).
http://rapid-i.com/rapidforum/index.php?PHPSESSID=8df44e17a8757da9149552265c34f044&;topic=2818.msg14886#msg14886
Dan