Slow iterations? Try MaterializeDataInMemory !!

haddockhaddock Member Posts: 849 Maven
edited November 2018 in Help
I have a parameter iteration which includes an iterating operator chain, in effect to perform multiple label sliding window validations. In order to keep things clean I've been careful to consume IO objects and clear memory after each loop. Nevertheless the process took three hours to complete, until I put in the magic ingredient. Now it takes 25 minutes!

If anyone has a spare moment could they explain what "MaterializeDataInMemory" actually does? I stumbled across it in the documentation for memory cleanup, and am as a consequence none the wiser. What is certain is that for me the following proved to be quite some understatement...
Might be very useful in combination with the MemoryCleanUp (see section 5.2.9)
operator after large preprocessing trees using lot of views or data copies.


  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hola Captain,

    yes, this one is a real gem in many cases - and can mean the death for your memory chips in others.

    The MaterializeDataInMemory operator is actually performing a very simple thing: it get an example set as input and creates a completely fresh copy of the data values in main memory by iterating over the data and storing all values in a new table and creating a new example set (a simple view on this fresh table).

    Most of the preprocessing operators but also many modeling schemes and others create a new view on the input example set and wrap the new view around the old ones. Here is an example:

    View 3: Normalize Values on the fly (<-- just created by a normalization operator)
    View 2: select only a subset of features
    View 1: original data view (this was the view directly created after loading)

    Hence, the bottom view (View 1) was the first one and the others were put on top of this. Of course this view concept reduces the amount of used memory (compared to always creating copies of your data before each step) but on the other hand the overhead for all the on the fly calculations and view handling uses some runtime.

    The MaterializeDataInMemory gets the complete view stack and creates a fresh table and view meaning that the whole data access is again as simple and fast as in the case of the original data view (View 1 above) - of course the new data is calculated based on the input view stack. A side effect is that this new table is created in memory which makes things again faster (if you have worked on a database before) but be aware that huge data sets from databases will lead to out of memory exceptions faster than you can count the characters in "MaterializeDataInMemory".

    Hope that clarifies things a bit.

  • Options
    haddockhaddock Member Posts: 849 Maven
    Ahoy there Ingo!

    Aha... No wonder it zips along so nicely now, many thanks for the heads up on that, and clear enough even for me!

    PS. I've sent you a PM on a different matter.
Sign In or Register to comment.