"R Extension VS RM Memory Sharing Problems"

dragoljubdragoljub Member Posts: 241 Contributor II
edited May 2019 in Help
Hi Everyone,

I have spent a few weeks learning R to developer a simple IQR normalization script for use within RM. I can successfully apply this script to example sets that contain 60k examples with 765 attributes (350MB) only ONCE. If I try to run the process again R runs out of memory and complains:

"Oct 27, 2010 2:21:55 PM SEVERE: IQR Norm: Error: cannot allocate vector of size 360.7 Mb"

It seems that RM is allocating too much unused memory from the system (Windows 7 x64 Ultimate). During the first run there is enough free memory available to run the R script. On subsequent runs R cannot allocate memory because RM has used around 6.95Gb / 8Gb which leaves around 12Mb free when taking into account the OS an other apps.

So what we need is a way to easily control how much memory we allow RM to take up.

Or we need to somehow ENABLE active garbage collection, where as soon as a process has finished executing the closed results are freed from memory.

Here is the simple script:
memory.limit(8*4000)
y <- as.data.frame(apply(x,2,function(col) (col-median(col))*1.349/IQR(col)))
warnings()
I increase memory requirements to begin, then run the script on each column of the example set, then print any warnings.

This works the first time, and then not again until I restart RM. The worst part is that when I set my RM to use max 4Gb it still commits more than 7Gb!

Any ideas?
-Gagi
Tagged:

Answers

  • dragoljubdragoljub Member Posts: 241 Contributor II
    Just a quick update.

    It looks like RM allows the R extension 2GB memory max to start. I extend that to 4GB to perform my analysis.

    Run 1:
    I load 350MB of data and run the R script to perform the analysis, then run garbage collection with the "gc()" function. After this R reports that 700MB of data are used. Which makes sense, the input 350MB + output 350MB.

    Run 2:
    I run the same process for a second consecutive time. After the second iteration, and after a second round of garbage collection R reports that 1.5GB of data are used! This is the problem new variables seem to be created each time I run the R script, rather than being over-written.

    Run 3:
    Process fails because R runs out of memory.

    As you can see the available memory quickly fills up during each iteration because it is not freed after the R script exits.

    How can I avoid this?  ???
    -Gagi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Gagi,
    could you send an example process with the R script and Data Generation included? I will try to reproduce and see where the memory leak is. Might be the variables in R aren't deleted properly...

    Greetings,
      Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Hi Sebastian,

    Here is the code that will reproduce this. I generate 60K examples with 765 attributes. (I have 8GB of ram.)

    Before you run the process check 'memory.size()' in the R perspective. You will get something like :
    memory.size()
    [1] 20.87

    Then run the process once and run memory.size() you will get:
    memory.size()
    [1] 783.69
    Then run again the memory usage increases:
    memory.size()
    [1] 1518.28

    On the third run R crashes with a memory limit problem.
    memory.size()
    [1] 3344.75
    And here is the RM output:
    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: Error: cannot allocate vector of size 350.6 Mb

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: In addition: Warning messages:

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: 1: In getDependencies(pkgs, dependencies, available, lib) :
      package 'mlr' is not available

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: 2: In unlist(ans, recursive = FALSE) :
      Reached total allocation of 4000Mb: see help(memory.size)

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: 3: In unlist(ans, recursive = FALSE) :
      Reached total allocation of 4000Mb: see help(memory.size)

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: 4: In unlist(ans, recursive = FALSE) :
      Reached total allocation of 4000Mb: see help(memory.size)

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: 5: In unlist(ans, recursive = FALSE) :
      Reached total allocation of 4000Mb: see help(memory.size)

    Oct 29, 2010 5:21:26 PM SEVERE: IQR Norm: Error: object 'y' not found
    Here is the process:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
        <process expanded="true" height="505" width="1765">
          <operator activated="true" class="generate_data" compatibility="5.0.11" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="60000"/>
            <parameter key="number_of_attributes" value="765"/>
          </operator>
          <operator activated="true" class="r:execute_script_r" compatibility="5.0.1" expanded="true" height="76" name="IQR Norm" width="90" x="179" y="30">
            <parameter key="script" value="memory.limit(4000)&#13;&#10;y &lt;- as.data.frame(apply(x,2,function(col) (col-median(col))*1.349/IQR(col)))&#13;&#10;warnings()&#13;&#10;gc()"/>
            <enumeration key="inputs">
              <parameter key="name_of_variable" value="x"/>
            </enumeration>
            <list key="results">
              <parameter key="y" value="Data Table"/>
            </list>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="IQR Norm" to_port="input 1"/>
          <connect from_op="IQR Norm" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Thanks for the help,  ???
    -Gagi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thanks a lot. I have located and removed the issue casing the memory leak. With some other improvements I will upload a fixed version.

    Greetings,
      Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Awesome!

    The next important thing is to allow passing of special attributes through the R-Script operator. For example  a check box in the operator asking if it should be applied on all attributes or only regular attributes. This would really help with meta data passing.  ;D

    -Gagi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi  Gagi,
    want to join our SIG for the R Extension? We are in search of users of the extension that can provide us with experience from practice. We don't use R quite often ourself...

    Actually you CAN pass special roles to and from R Script operator. If you export an ExampleSet as variable eSet, then there will exist variables eSet.<special role name> containing the names of the special rows. You can use them for setting target in your R Script.
    If you import a dataframe called eSet is checked for every predefined special role, if there exists a variable with the name eSet.<special role name>. If an attribute with such a name exists, this is set to this role.

    Greetings,
      Sebastian
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Hi Sebastian,

    I would love to help out. I don't use R that much because of the issues I encountered with memory. I just resorted to writing operators directly in java. Having the R script work transparently in the RM flow will still be very useful for custom data manipulation. I think the key is to automatically pass special attributes through the R script without user intervention . Its nice to have access to them within R, but in general users rarely if ever change their special attributes. They want to focus on data manipulation, and keep the special attributes available later in the flow for plotting or table viewing.

    One quick thing I would like to implement is an L2-Norm function that is very easy to code in R so perhaps that could be thd next thing to test and see how RM handles normalizing large data-sets.

    -Gagi
Sign In or Register to comment.