🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Challenge with RM Server - Running out of memory

Ramesh_TRamesh_T Member Posts: 2 Contributor I

Hi there!

I am a newbie and this is my first post in the community. We have got a RM server installation on top of a MS SQL server box. We have a job container with 64GB RAM. I have built some workflows using sample data on studio environment and trying to run those processes after necessary changes in server environment connecting to original SQL data tables. These workflows mainly involve some basic data joins and summarization after application of few domain specific business rules.

When I am trying to run a flow, I am quickly running into the issue of Running Out of Memory. The challenge I have is, even the first part of my flow which involves reading few variables from a 40GB dataset is not getting completed. Due to the nature of data and business knowledge involved, I am not in a position to share the XML flow or log files here. 

I have few specific questions for the community:

1. How does RM Server handles memory internally? Will the whole source data file be read and kept in memory while processing?

2. What is the maximum database size at source that can be handled by a 64GB single container?

3. Will you recommend RM server for huge data processing operations (i.e. data running closer to a TB in size).

Thanks,

Ramesh

Tagged:
sgenzerdbabrauskaiteSGolbert

Answers

  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 179  RM Research

    if you try to load a whole 40GB table into RapidMiner it could be possible that you can run into memory issues with 64 GB of RAM. If you only need a few variables from the data set, you can either query them directly in the Read Database operator, or in case you have a more complex ETL workflow, take a look a the In-Database Processing extension, at the RapidMiner marketplace.

    With that you can shift most of the pre-processing workload from RapidMiner to the database.

    Best,
    David

    Ramesh_TdbabrauskaiteSGolbert
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,512  Community Manager
    Hi @Ramesh_T - I of course agree with everything @David_A has said above. If you are doing basic ETL like joins on large DB tables, you are almost always going to be better off doing those in-database rather than in RapidMiner. The In-Database Extension is quite good especially if you don't want/like writing SQL. :smile:

    Another nice tool to use for cases like this is the caching operators from Old World Computing in their Jackhammer extension. They have just published some new blog articles showing how this is done - you can find part 1 here: https://oldworldcomputing.com/tutorial-introduction-to-caching-functions-of-the-jackhammer-extension-by-old-world-computing/  It is designed almost exactly for your use case. I'm cc'ing @land in case he has something more to add.

    Scott
    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 

    David_ARamesh_Tdbabrauskaite
  • Ramesh_TRamesh_T Member Posts: 2 Contributor I

    @ David_A and @sgenzer:

    Thank you for taking time to respond and for your inputs.

    I am using a Read Database operator with a query to pull few variables. Will look into InDatabase Processing and JackHammer extension. Will keep this thread updated.

    Thanks,

    Ramesh

    sgenzer
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 341   Unicorn
    Hi,

    @David_A I didn't know about the In-Database Processing Extension, it is quite useful, thanks!

    @Ramesh_T if you don't need to summarize across all rows (or if you can reconstruct the summary of all rows from sub-summaries, for example for the mean of a column) you can also fetch the rows in batches. Otherwise, it makes sense to preprocess the data in the database and then use RM for the machine learning parts.

    Regards,
    Sebastian
Sign In or Register to comment.