Options

How to work with a very large .csv file?

ncjanesncjanes Member Posts: 3 Contributor I
edited December 2018 in Help

Hi folks, it looks like there has been a similar post or two in the past, but years old at this point so I thought it would be helpful to refresh...

 

I need to load a huge .csv file (4.72GB, ~23MM lines), and I need to break it up into smaller .csv files according to one polynomial attribute. This is public State of Texas data, so the attribute by which I want to split into smaller data sets is "County", and I want those new .csv files to be kicked out onto my local disk. 

 

What's the best, most computationally efficient way to do this? 

 

Thanks!!

 

cc @sgenzer

Tagged:

Answers

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Hi @ncjanes,

     

    This is an interesting topic. You can have several options. Some ideas off the top of my head.

    1. shell scripts, split the file by the value of a2
      while IFS=, read -r a1 a2 a3 a4; do 
      echo "$a1,$a2,$a3,$a4" >> "$a2".csv
      done < file.csv
      Please refer to this post for examples
    2. RDB(relational database), load csv data into mySQL or postgreSQL DB table and run in-database SQL query to select subsets
    3. scale up, if you have access to an environment loaded with big RAM then you can run RapidMiner process to load big csv data and split the tabular data with (a). "loop value" operator (enable parallel execution to make it run faster). Inside the loop, you can use the iteration macro with "filter example" to find subset and then "write csv" with macro defined file name to store individual small csv files. (b). "Group Into Collection" operator with "loop collection" and inside loop you can write each example into seperate csv
    4. scale out, HDFS or RDD are popular solution nowadays for really huge data (4.72GB is not that big)

    HTH!

     

    YY

     

  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi, 

     

    Just my 5 cents... I would say the most computatiionaklly efficient way is the second suggested by YY. 

    You may break it down into two parts actually: 

     

    1) Use simple Python (or whatever else) script to read the arbitrarily large file in chunks and then save them into SQL table, see Python example here: http://odo.pydata.org/en/latest/perf.html

    2) Then make SQL queries on subsets, using any SQL related tool, or RapidMiner.

     

    This approach can easily let you have this data available for any future processing, including using it within different RM priocesses. 

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    Hi,

     

    this reminds me of an operator i wanted to implement: Stream Lines 

    Basically, it reads a file line by line and gives it to you as a document. You could then filter it and move forward.

     

    Now i just need to find the time to implement this.

     

    @kypexin : Do you think this is useful?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi Martin, 

     

    Well, some time ago I was asking more or less the same question about processing long csv: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Partial-retrieve-of-example-set/m-p/42320

     

    So that kind of operator you describe would be really helpful in my opinion. At least it can give an alternative option for handling such data. 

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    Hi @kypexin,

    well doing something like this on a repository item is harder, since these are either DB tables or serialized java objects. Doing it on flat files is easier..

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    But this is still a good alternative solution for the problem :) So I think I would use such an operator!

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I think it would be a very useful operator to add!  I still miss the similar "Stream DB" which used to exist but was deprecated for some unknown reason back in v 7 somewhere...

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    The most elegant and RapidMiner-like solution would be a two-part operator like Handle Exception.

    In the left part you could put an operator returning an example set (like Read Database, Read CSV, Read Hive). The output would be processed by the right part in batches of selectable size.

    The input operators could be updated to know about the batch size (internal parameter?) and would adhere to it, saving memory in the process. Operators that aren't yet updated would just return everything, so this would be like Loop Batches with the data source outside. 

     

    Balázs

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    @BalazsBarany,

    that would requiere that the users specifies his batching manually on the left hand side?

    If yes - you can do this with a normal Loop operator?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    As long as the input operators don't have the optional batch size, this is just an elegant frontend for our existing loops or loop batches.

     

    My idea is more futuristic.

Sign In or Register to comment.