Should I Use a database or not?

siamak_wantsiamak_want Member Posts: 98 Contributor II
edited November 2018 in Help
Hi RM forum,

I am working on a relatively huge text data set  (CSV file) with about 300,000 records and about 40,000 columns. I read the data with a "CSV Reader" and then process it via classification and clustering operator. The problem is that, as far as I know, the "CSV Reader" operator, first of all, loads the entire CSV file into main memory. The CSV file is several gigabytes and it uses more than a half of my main memory. So many algorithms starve from insufficient amount of memory.

As a workaround, I converted my data set to the fantastic "RM sparse format". I don't know how the "sparse reader" delivers the data but this could alleviate the problem of insufficient amount of main memory. Nevertheless the problem does still exist.

Now, in order to handle the problem of insufficient main memory,  I have a new Idea about using a database (like postgre SQL):)! I think that a database can fetch the proper amount of examples into main memory and deliver the proper number of examples to other operators (i.e. classification operators).  please explain if I am right about this. I need to know the advantages and disadvantages of using a database vs using simple files.

Any help would be appreciated.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    I don't see any disadvantages of databases vs. text files, and you already stated the disadvantages of text files. So for any reasonable amount of data I would suggest to import the data into a database.

    Btw, the sparse data format only works well if you have a lot of zeros in your data - otherwise it will use even more memory than the standard data format.

    Best, Marius
  • siamak_wantsiamak_want Member Posts: 98 Contributor II
    Hi Marius,

    Thanks to your straightforward answer (as always).  So, for sure I will use a database as you stated. But there is an important issue about this I think:

    A DBMS (for example PostgreSQL) can handle tables with less than 2000 columns. I am working on text data set. so I need to deal with about 80,000 features which does not fit in a regular data base table. Do you know any workaround?

    Another thing that I am curious about is that, what should I do when the table is so big that it does not fit in main memory at all? Is the DBMS or RM able to load the data incrementally?

    Thanks again Marius.
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    If you have less examples than features you could transpose the dataset before storing it in the DB, and re-transpose it after loading.

    For big tables, you can of course only read parts of the table from the DB and work on the subset (which is the usual way of dealing with large datasets - learn on a subset which fits into memory, then apply to large datasets incrementally).

    For reading data automatically incrementally from the DB there is the Read Database (Incrementally) operator in RapidMiner, but I have to admit that I did not use it until now. Usually I rather load the subset of data I need manually with the normal Read Database operator.

    Best, Marius
  • siamak_wantsiamak_want Member Posts: 98 Contributor II
    Hi the nice guy, Marius,

    I think you mentioned a nice discussion...
    learn on a subset which fits into memory, then apply to large datasets incrementally
    would you please explain more? Do you mean I should use a sampling method to select some examples manually and then I should apply it on all of the data set? Then how about clustering?

    Thanks in advance.
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    siamak_want wrote:

    would you please explain more? Do you mean I should use a sampling method to select some examples manually and then I should apply it on all of the data set? Then how about clustering?
    Basically yes. if you choose a clustering algorithms which results in a cluster model (e.g. k-Means or any other centroid-based clustering), you can:
    - create a representative sample of your data which fits into main memory
    - create the cluster model and store it in the repository
    - create a process which loads the model, iteratively reads parts of your data, applies the model and writes the result back to the database

    Hope this helps,
    Marius
  • siamak_wantsiamak_want Member Posts: 98 Contributor II
    Thanks to Marius
Sign In or Register to comment.