How to work with a very large .csv file?

ncjanes · April 2018

Hi folks, it looks like there has been a similar post or two in the past, but years old at this point so I thought it would be helpful to refresh...

I need to load a huge .csv file (4.72GB, ~23MM lines), and I need to break it up into smaller .csv files according to one polynomial attribute. This is public State of Texas data, so the attribute by which I want to split into smaller data sets is "County", and I want those new .csv files to be kicked out onto my local disk.

What's the best, most computationally efficient way to do this?

Thanks!!

cc @sgenzer

yyhuang · April 2018

Hi @ncjanes,

This is an interesting topic. You can have several options. Some ideas off the top of my head.

shell scripts, split the file by the value of a2

while IFS=, read -r a1 a2 a3 a4; do 
    echo "$a1,$a2,$a3,$a4" >> "$a2".csv
done < file.csv

Please refer to this post for examples

RDB(relational database), load csv data into mySQL or postgreSQL DB table and run in-database SQL query to select subsets
scale up, if you have access to an environment loaded with big RAM then you can run RapidMiner process to load big csv data and split the tabular data with (a). "loop value" operator (enable parallel execution to make it run faster). Inside the loop, you can use the iteration macro with "filter example" to find subset and then "write csv" with macro defined file name to store individual small csv files. (b). "Group Into Collection" operator with "loop collection" and inside loop you can write each example into seperate csv
scale out, HDFS or RDD are popular solution nowadays for really huge data (4.72GB is not that big)

HTH!

YY

kypexin · April 2018

Hi,

Just my 5 cents... I would say the most computatiionaklly efficient way is the second suggested by YY.

You may break it down into two parts actually:

1) Use simple Python (or whatever else) script to read the arbitrarily large file in chunks and then save them into SQL table, see Python example here: http://odo.pydata.org/en/latest/perf.html

2) Then make SQL queries on subsets, using any SQL related tool, or RapidMiner.

This approach can easily let you have this data available for any future processing, including using it within different RM priocesses.

MartinLiebig · April 2018

Hi,

this reminds me of an operator i wanted to implement: Stream Lines

Basically, it reads a file line by line and gives it to you as a document. You could then filter it and move forward.

Now i just need to find the time to implement this.

@kypexin : Do you think this is useful?

~Martin

kypexin · April 2018

Hi Martin,

Well, some time ago I was asking more or less the same question about processing long csv: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Partial-retrieve-of-example-set/m-p/42320

So that kind of operator you describe would be really helpful in my opinion. At least it can give an alternative option for handling such data.

MartinLiebig · April 2018

Hi @kypexin,

well doing something like this on a repository item is harder, since these are either DB tables or serialized java objects. Doing it on flat files is easier..

~Martin

kypexin · April 2018

But this is still a good alternative solution for the problem So I think I would use such an operator!

Telcontar120 · April 2018

I think it would be a very useful operator to add! I still miss the similar "Stream DB" which used to exist but was deprecated for some unknown reason back in v 7 somewhere...

BalazsBarany · April 2018

The most elegant and RapidMiner-like solution would be a two-part operator like Handle Exception.

In the left part you could put an operator returning an example set (like Read Database, Read CSV, Read Hive). The output would be processed by the right part in batches of selectable size.

The input operators could be updated to know about the batch size (internal parameter?) and would adhere to it, saving memory in the process. Operators that aren't yet updated would just return everything, so this would be like Loop Batches with the data source outside.

Balázs

MartinLiebig · April 2018

@BalazsBarany,

that would requiere that the users specifies his batching manually on the left hand side?

If yes - you can do this with a normal Loop operator?

Best,

Martin

BalazsBarany · April 2018

As long as the input operators don't have the optional batch size, this is just an elegant frontend for our existing loops or loop batches.

My idea is more futuristic.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to work with a very large .csv file?

Answers