Memory usage in EvolutionWeighting

keithkeith Member Posts: 157 Maven
edited November 2018 in Help
What data does RM need to retain as it processes multiple generations in an EvolutionaryWeighting node?  I have a process that is trying to optimize attribute weights for a NearestNeighbor model, and I'm finding that after a relatively small number of generations (as low as 15), all the memory allocated to RM has been used, and the java process freezes. 

This is on 32-bit Win XP, with a few thousands records in an example set, and about a dozen attributes.  The relevant process snippet is:

EvolutionaryWeighting (50 generations, 5 population_size, using intermediate weights file, tournament selection, keep_best_individual)
-MemoryCleanUp (attempt to limit memory usage)
-OperatorChain
--XValidationParallel (5  validations, shuffled sampling)
---NearestNeighbor (k=10, weighted vote)
---OperatorChain
----ModelApplier
----Performance
---ProcessLog (logging generation, best, performance)

My hope was that I'd be able to leave the process running for a long time, potentially days, to allow it to evolve throughout the search space for a best answer, but the reality right now is that RM stalls out.  No error message, just a lack of processing.  Even the system time display fails to update.  My guess is some sort of Java memory management, but I don't know if the amount of memory in use should grow so high or not. 

Any ideas?

Thanks,
Keith

Answers

  • Legacy UserLegacy User Member Posts: 0 Newbie
    Did you try this process without the process log? If you jitter the logged values you will probably see that there are thousands of logged values stored during the process (and not only one per generation as you might think - the reason is that the ProcessLog collect the values each time it is used, i.e. for each evaluation).

    I made the experience that without using the ProcessLog operator my feature selection / weighting processes need much less memory  ;D
  • keithkeith Member Posts: 157 Maven
    Thanks for the suggestion.

    The table view of the Process Log shows 3 values recorded for every member of the population within a generation.  I.e. with 10 generations, and a population of 5, it records a total of 10*5*3 = 150 values.  Doesn't seem like it's the cause of the memory growth that I'm seeing, unless it is actually creating a lot more data behind the scenes that isn't made visible.

    I will certainly try turning off logging to see if that helps.  However, I'd prefer not to disable process logging, as its one of the few ways to get any visibility into how RM is progressing during a lot process run.
  • keithkeith Member Posts: 157 Maven
    Update: Removing the ProcessLog node slowed down the rate of memory usage growth, allowing me to get another 10-15 generations completed, but eventually the process still maxed out the available memory and stalled.  Which still makes me wonder what results RM stores from previous generations, so I could understand whether this is expected behavior that I need to account for, or whether it is a memory leak.
  • keithkeith Member Posts: 157 Maven
    Further update: This may be more related to cross-validation than evolutionary weighting.  I removed the XValidationParallel  from the process, just running/applying the model once on the entire dataset on each pass through the loop.  The process has been running for several hours, with only modest growth in memory usage.

    I'm not sure if it is the XValidationParallel operator itself, or the fact that the learner/applier/performance combination ends up being run so many more times that causes the memory usage to grow as high as it does.  And I'm still not sure if this is expected behavior or a bug.  But at least I know more about what's causing it, and for now I can run the model without cross-validation.

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Ah, thanks for the info. The parallel versions do have to copy the view on the data (which is some overhead by itself) for each parallel branch. If you have large data sets with many attributes, the additional memory usage sometimes is very large if you for example specify 4 or 8 cores. Please keep in mind that additionally to the views we of course also need additional memory to keep track of the different subprocesses.

    But nevertheless there also might be a second reason for the high memory usage: there seems to be a GUI related memory leak in cases where results are displayed at a breakpoint or at the end of the process. In many cases, the resources are not freed (at least not for a long time) after the results are displayed once. The whole team is currently profiling RapidMiner and searching for this leak but as I have to admit we did not have any success yet. I will let you know as soon as we found the reason and will deliver an updated version as soon as possible.

    For now, I assume that removing the ProcessLog was probably related to this GUI memory leak. And the additional amount of used memory for the parallel cross validation is quite normal.

    We will keep you updated about that.

    Cheers,
    Ingo
  • keithkeith Member Posts: 157 Maven
    Thanks, Ingo.  That does help explain what I've been seeing.

    One further clarification: RM is running on a single-CPU, dual-core system, so the parallel xval should be broken up into 2 subprocesses.  After the XValidationParallel node is completed, should the memory usage return to its pre-XVal level, or will there be some portion of that memory that RM retains?  What I was seeing was a gradual increase in memory usage as XValidationParallel was called multiple times (even after removing ProcessLog), not just a temporary spike during the execution of the node.  And this was true even though I included a MemoryCleanup node in the EvolutionaryWeighting inner loop.

    Thanks again,
    Keith
Sign In or Register to comment.