best input data format for large data sets?

harri678 · March 2010

Hi,

I wanted to ask what's the recommended import format for large datasets?

My dataset has the following specs:
- 36000 samples altogether splitted in 5 groups of 7200 samples each
- timestamp = id, integer label
- theoretical maximum of 1.200.000 integer attributes (for now a subset of about 5000 has been chosen, but more would be better)

Currently I am using an "import" process which does:
- CSV import (one CSV file for 7200 samples)
- define roles
- some normalization
- "write binary"

The binary files are re-read in the classification process, because it's faster than parsing all the CSV's every time. My problem is that if I increase the number of attributes in the CSV, the "import" process eats up all the memory and dies (7Gb). I also experimented with "Free Memory" it didnt help.

My question is now: is there a better format than CSV for large datasets which is still directly processable in decent speed so I can maybe drop this import step? What would you recommend?

Thanks,
Harald

RalfKlinkenberg · March 2010

Hi Harri,

if your data is sparse (a lot of zero and significantly less non-zero attribute values), you may want to try the sparse file and data formats. They store only the non-zero values and hence are the preferred representation for sparse data sets like large text collections.

Best regards,
Ralf

harri678 · March 2010

Hi,

I managed it with the Read AML Operator and sparse storage. Thanks!

Greetings, Harald

wessel · March 2010

There seems to be a big improvement in version 5 compared to version 4 when reading data.
Version 5 is much faster. So download version 5.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

best input data format for large data sets?

Answers