Trouble to Load Large csv File

JieDuJieDu Member Posts: 6 Contributor II
edited November 2018 in Help
Hi,

I have a trouble to load a 1.5 GB csv file (1.2M records with 30 variables). I used the "Import CSV file..." function in the "Repositories" drop down list. It ran awhile then stopped without any error messages, as if it ended successfully. However, I cannot find the loaded data file at the location specified. It acted as if the process were timed out. I tried a 200+MB file of the same format. It worked fine. Does RM have limitation on file size or time when loading a file? Has anyone experienced the similar problem?

Your advice is highly appreciated.

Jie

Answers

  • JieDuJieDu Member Posts: 6 Contributor II
    Haddock,

    Thank you very much for the response. I tried loading the file on two computers with different configurations. It seems that the failure is due to the resources. Here are the tests:

    1) A 64-bit Windows-7 laptop with 8GB memory and plenty of hard disk space: The memory usage was 2.7GB after RM was launched. It went up to ~7.5GB while loading the csv file, then went down to 6.6 ~ 7.0 GB. Then the loading process ends without producing any dataset at the repository

    2) A 65-bit Windows-7 Desktop with 16GB memory and plenty of hard disk space: The memory usage was 2.7GB after RM was launched. It went up to ~12.2GB while loading the csv file. The loading is a success

    I am surprised to find that RM is this resource demanding.  Loading a 1.2GB file, which can only be called a medium size file at most now, needs 12.2 GB memory. The click stream data we are working on can easily double the size of 1.2 GB. The trouble I have is for loading. From the link you send over, it seems that resource is an issue at the analysis stage as well. Do you have any suggestions on how to deal with it?

    Thanks a lot,

    Jie
  • haddockhaddock Member Posts: 849 Maven
    Hi there Jie,

    To some extent I agree with you, that's a large overhead, but also to some extent I don't, as that's a lot of admin and housekeeping that Java does to keep the machine from cooking itself! I know that Ingo and his mob are big on Radoop ( check out the Rapid-I site ), so perhaps you could look in that direction.

    For me the question was easier, I only deal with binominals, so I can deal at the bit level, and get my data and programs to zip along on a CUDA card with just 2GB of memory, whereas the same data takes up 7GB of Java memory. But if I put a foot wrong on the card, kerpow, performance and instability go hand in hand, as ever.

    On a practical level, I've nearly always found disproportionate RM performance benefits are to be had by chunking up your examples, and looping over the chunks. Just an idea.

    Good hunting!
  • JieDuJieDu Member Posts: 6 Contributor II
    Hi Haddock,

    Thanks again for the insights. I am looking for a good low cost data mining tool for my team to use. Knowing the limitation is as important as knowing the functionality.

    Jie
  • haddockhaddock Member Posts: 849 Maven
    Hi there,

    If you're looking for a team platform you could look at RapidAnalytics servers and RapidMiner clients, the downside to both is the same, thin documentation, but in an amazingly balanced way the advantages are the same - the zero cost. Open source is however the real decider for me, and once you get into the Lego mindset the documentation loses importance.

    Regards

  • RayJhongRayJhong RapidMiner Certified Analyst, Member Posts: 11 Contributor II

    I'v encountered the same issue when imported an 1.2GB csv format dataset into RM studio, won't success with less than 8GB RAM.

    The datasource is from Kaggle as linked:

    https://www.kaggle.com/backblaze/hard-drive-test-data

    these size of files i think it's pretty common, comparing to BI tools like Tableau, extracting the same dataset only take no more than 900MB on memory. Maybe RM could improve some of those kinds of operators when reading or importing those size of  datasets.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @RayJhong - so to be honest, I think that if you're working with 1GB+ data files, you should either upgrade your RAM (8GB is really baseline for Studio) or use a database.

     

    Scott

     

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Jie,

     

    I agree with you in that the memory use is excesive. This is somehow common in Java-software.

     

    Did you try to load the data using a Python or R script? This could free some memory, depending on how efficient the conversion from data.table / pandas to RapidMiner is. 

     

    Of course you will probably still need more RAM to train models with the dataset. You can also try to use a combination of the Store and Free Memory operators, in order not to have a lot of tables hanging around in your memory.

     

    Best,

    Sebastian

     

Sign In or Register to comment.