"huge csv file to read from"

frazmanfrazman Member Posts: 10 Contributor II
edited May 2019 in Help
I have collected a huge database of products (with their descriptions, name, prices and labels)
Now I am trying to create a multiclass classifier to automatically classify these products.
So, right now I am using a k-NN classifier and reading a subset of that data.
1) Now, a lame question: How to build a good classifier when you have such sparsity in data (alot of categories vs the few attributes and the problem is that in document classification, usually the word lists are huge, here you have product, product description and price.. hence not many words) How do we solve that.Any suggestions, advice would be greatly appreciated.
I used the inputs from vacouver data blogspot.
2) How do I give this huge training data.
I always hit the memory limitations :(


  • frazmanfrazman Member Posts: 10 Contributor II
    Can anyone give me an advice how to read from a huge csv file..
    It takes like an hour and even then it doesnt finish.
    And if the file is too huge, then it gives a memory limitation error?
    Please advice me on this
  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist

    The problem you mentioned occurs when there is not enough memory available to build the entire example set. When you open a data file in RapidMiner and start reading its content all the information is stored as Java data types in the heap memory. This amount of memory maybe by some factor bigger than the file on your hard drive actually looks like. Basically there are some options to proceed:

    1. If possible let RapidMiner acquire more Ram (e.g. by adjusting the memory variable in the RM start script).
    2. In case the amount of data remains to high, try to work on a smaller sample and / or use a database for storage.
    3. When you want to use RapidMiner for ETL purposes you may try to split the data into several files in order to accomplish your task.

Sign In or Register to comment.