Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"huge csv file to read from"

frazmanfrazman Member Posts: 10 Contributor II
edited May 2019 in Help
Hi,
I have collected a huge database of products (with their descriptions, name, prices and labels)
Now I am trying to create a multiclass classifier to automatically classify these products.
So, right now I am using a k-NN classifier and reading a subset of that data.
1) Now, a lame question: How to build a good classifier when you have such sparsity in data (alot of categories vs the few attributes and the problem is that in document classification, usually the word lists are huge, here you have product, product description and price.. hence not many words) How do we solve that.Any suggestions, advice would be greatly appreciated.
I used the inputs from vacouver data blogspot.
2) How do I give this huge training data.
I always hit the memory limitations :(
Tagged:

Answers

  • frazmanfrazman Member Posts: 10 Contributor II
    Hi..
    Can anyone give me an advice how to read from a huge csv file..
    It takes like an hour and even then it doesnt finish.
    And if the file is too huge, then it gives a memory limitation error?
    Please advice me on this
    Thanks
  • homburghomburg Employee, Member Posts: 114 RM Data Scientist
    Hi.

    The problem you mentioned occurs when there is not enough memory available to build the entire example set. When you open a data file in RapidMiner and start reading its content all the information is stored as Java data types in the heap memory. This amount of memory maybe by some factor bigger than the file on your hard drive actually looks like. Basically there are some options to proceed:

    1. If possible let RapidMiner acquire more Ram (e.g. by adjusting the memory variable in the RM start script).
    2. In case the amount of data remains to high, try to work on a smaller sample and / or use a database for storage.
    3. When you want to use RapidMiner for ETL purposes you may try to split the data into several files in order to accomplish your task.

    Cheers,
       Helge
Sign In or Register to comment.