Challenge with RM Server - Running out of memory

Ramesh_T · May 2019

Hi there!

I am a newbie and this is my first post in the community. We have got a RM server installation on top of a MS SQL server box. We have a job container with 64GB RAM. I have built some workflows using sample data on studio environment and trying to run those processes after necessary changes in server environment connecting to original SQL data tables. These workflows mainly involve some basic data joins and summarization after application of few domain specific business rules.

When I am trying to run a flow, I am quickly running into the issue of Running Out of Memory. The challenge I have is, even the first part of my flow which involves reading few variables from a 40GB dataset is not getting completed. Due to the nature of data and business knowledge involved, I am not in a position to share the XML flow or log files here.

I have few specific questions for the community:

1. How does RM Server handles memory internally? Will the whole source data file be read and kept in memory while processing?

2. What is the maximum database size at source that can be handled by a 64GB single container?

3. Will you recommend RM server for huge data processing operations (i.e. data running closer to a TB in size).

Thanks,

Ramesh

David_A · May 2019

Hi @Ramesh_T ,

if you try to load a whole 40GB table into RapidMiner it could be possible that you can run into memory issues with 64 GB of RAM. If you only need a few variables from the data set, you can either query them directly in the Read Database operator, or in case you have a more complex ETL workflow, take a look a the In-Database Processing extension, at the RapidMiner marketplace.

With that you can shift most of the pre-processing workload from RapidMiner to the database.

Best,
David

sgenzer · May 2019

Hi @Ramesh_T - I of course agree with everything @David_A has said above. If you are doing basic ETL like joins on large DB tables, you are almost always going to be better off doing those in-database rather than in RapidMiner. The In-Database Extension is quite good especially if you don't want/like writing SQL.

Another nice tool to use for cases like this is the caching operators from Old World Computing in their Jackhammer extension. They have just published some new blog articles showing how this is done - you can find part 1 here: https://oldworldcomputing.com/tutorial-introduction-to-caching-functions-of-the-jackhammer-extension-by-old-world-computing/ It is designed almost exactly for your use case. I'm cc'ing @land in case he has something more to add.

Scott

Ramesh_T · May 2019

@ David_A and @sgenzer:

Thank you for taking time to respond and for your inputs.

I am using a Read Database operator with a query to pull few variables. Will look into InDatabase Processing and JackHammer extension. Will keep this thread updated.

Thanks,

Ramesh

SGolbert · June 2019

Hi,

@David_A I didn't know about the In-Database Processing Extension, it is quite useful, thanks!

@Ramesh_T if you don't need to summarize across all rows (or if you can reconstruct the summary of all rows from sub-summaries, for example for the mean of a column) you can also fetch the rows in batches. Otherwise, it makes sense to preprocess the data in the database and then use RM for the machine learning parts.

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Challenge with RM Server - Running out of memory

Answers