best input data format for large data sets?

harri678harri678 Member Posts: 34 Contributor II
edited November 2018 in Help

I wanted to ask what's the recommended import format for large datasets?

My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)

Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"

The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.

My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?



  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hi Harri,

    if your data is sparse (a lot of zero and significantly less non-zero attribute values), you may want to try the sparse file and data formats. They store only the non-zero values and hence are the preferred representation for sparse data sets like large text collections.

    Best regards,
  • harri678harri678 Member Posts: 34 Contributor II

    I managed it with the Read AML Operator and sparse storage. Thanks!

    Greetings, Harald
  • wesselwessel Member Posts: 537 Maven
    There seems to be a big improvement in version 5 compared to version 4 when reading data.
    Version 5 is much faster. So download version 5.
Sign In or Register to comment.