Options

"Rapidminer Studio Crashes and says not responding"

pkkarripkkarri Member Posts: 3 Contributor I
edited June 2019 in Help
Hello,
    I have a huge training sample with around 32 million records and 10 attributes, all of which take discrete integer values for data. This training data in CSV format is around 4GB.
    I'm trying to build a random forest model on this training data but every time I try it runs for about an hour, uses more than 60GB of main memory and stops responding. There is no further progress after that. I ran it with default parameters for random forest operator.
    Please help me with this. Are there any guidelines regarding the size of the training data vs memory required for a modeling operator to run.??? If no, please help me solve  my current problem.

Best Regards,
Praveen Karri.
Tagged:

Answers

  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Hi Praveen,

    I generally work to a rule of 2GB of RAM per 1M examples (with around 50-100 attributes) so it sounds as though you are
    Are you sure the application is not calculating in the background and just unresponsive, may I ask how long after the hour you left it? 
    Also, what speed is it for different data samples? 

    You might want to create a process which adds increasing data sample sizes to the random forest and log how long training takes.  (You can write the log to disk as you go). This way you can see how the training time is affected by the amount of data. 
  • Options
    pkkarripkkarri Member Posts: 3 Contributor I
    Thank you for the Suggestions Edward. I will try with an incremental training set approach.

    I let the process run for more than 5 hours after it was stuck. There was almost 0% CPU usage and around 60GB RAM usage throughout the 5 hours. I am currently trying it with smaller samples.
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Also try experimenting with things like higher leaf size parameters.  I'm sure the default of 2 might be a bit too defined for your case. 

    Once you get the optimal settings on a smaller training sample you can then loop in batches to create several Random Forests on different subsets of the full data.  You can then try combining those together into a single model.  That's a bit more advanced though so let's see how it goes on the smaller samples first and help you further from there.

Sign In or Register to comment.