I'm trying to perform some (relatively) extensive computations
with RapidMiner 4.2 under Linux.

In the RapidMiner script I set MAX_JAVA_MEMORY=4000
(I have 4 GB RAM).

The model I'm using is basically from the 01_ParameterOptimization
example from the meta-sample directory:

<?xml version="1.0" encoding="US-ASCII"?>
<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <description text="#ylt#p#ygt# Often the different operators have many parameters and it is not clear which parameter values are best for the learning task at hand. The parameter optimization operator helps to find an optimal parameter set for the used operators. #ylt#/p#ygt#  #ylt#p#ygt# The inner crossvalidation estimates the performance for each parameter set. In this experiment two parameters of the SVM are tuned. The result can be plotted in 3D (using gnuplot) or in color mode. #ylt#/p#ygt#  #ylt#p#ygt# Try the following: #ylt#ul#ygt# #ylt#li#ygt#Start the experiment. The result is the best parameter set and the performance which was achieved with this parameter set.#ylt#/li#ygt# #ylt#li#ygt#Edit the parameter list of the ParameterOptimization operator to find another parameter set.#ylt#/li#ygt# #ylt#/ul#ygt# #ylt#/p#ygt# "/>
      <operator name="Input" class="ExampleSource">
          <parameter key="attributes"  value="../data/polynomial.aml"/>
      <operator name="ParameterOptimization" class="GridParameterOptimization" expanded="yes">
          <list key="parameters">
            <parameter key="RandomTree.minimal_leaf_size"      value="[1.0;2.147483647E9;10;linear]"/>
            <parameter key="RandomTree.maximal_depth"  value="[-1.0;2.147483647E9;10;linear]"/>
            <parameter key="RandomTree.subset_ratio"    value="[-1.0;1.0;10;linear]"/>
            <parameter key="RandomTree.criterion"      value="gain_ratio,information_gain,gini_index,accuracy"/>
            <parameter key="RandomTree.confidence"      value="[1.0E-7;0.5;10;linear]"/>
          <operator name="Validation" class="XValidation" expanded="yes">
              <parameter key="sampling_type"    value="shuffled sampling"/>
              <operator name="RandomTree" class="RandomTree">
                  <parameter key="confidence"  value="0.050000090000000004"/>
                  <parameter key="maximal_depth"        value="1503238553"/>
                  <parameter key="minimal_leaf_size"    value="1717986918"/>
                  <parameter key="subset_ratio" value="0.0"/>
              <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                  <operator name="Test" class="ModelApplier">
                      <list key="application_parameters">
                  <operator name="Evaluation" class="RegressionPerformance">
                      <parameter key="absolute_error"  value="true"/>
                      <parameter key="normalized_absolute_error"        value="true"/>
                      <parameter key="root_mean_squared_error"  value="true"/>
                      <parameter key="squared_error"    value="true"/>
          <operator name="Log" class="ProcessLog" activated="no">
              <parameter key="filename" value="paraopt.log"/>
              <list key="log">
                <parameter key="C"      value="operator.Training.parameter.C"/>
                <parameter key="degree" value="operator.Training.parameter.degree"/>
                <parameter key="absolute"      value="operator.Validation.value.performance"/>
      <operator name="ParameterSetWriter" class="ParameterSetWriter" activated="no">
          <parameter key="parameter_file"      value="parameters.par"/>
      <operator name="GnuplotWriter" class="GnuplotWriter" activated="no">
          <parameter key="additional_parameters"        value="set grid"/>
          <parameter key="name" value="Log"/>
          <parameter key="output_file"  value="parameter_optimization.gnu"/>
          <parameter key="values"      value="absolute"/>
          <parameter key="x_axis"      value="C"/>
          <parameter key="y_axis"      value="degree"/>

The computation of that model never succeeds since I get a
"OutOfMemoryError caught: Java heap space".

I'm little bit surprised since I just want to find the best combination
among 5 parameters (see GridParameterOptimization) which obviously entails
lots of computations but is a realistic problem with an acceptable
complexity. So, why does RapidMiner/java consumes that much memory?
I mean 4 GB should be OK and I would expect that RapidMiner/java tries
to maximally exploit this memory without running into an exception.

How can I solve this problem? Are there any ways to manually run a
garbage collection after some iterations of the validation? Or can
java be set up such that it restricts its process memory usage at
cost of speed without running into an exception?

I can hardly imagine that my problem is too complex for a machine
learning tool and that problems of that complexity have not been
solved by RapidMiner in the past. :-)

Elite II

Re: OutOfMemoryError

Hi Martin,
I admit, this behavior is rather strange. But if you take a look on the log, you will see a warning " [Warning] ParameterOptimization: Cannot evaluate performance for current parameter combination: This learning scheme does not have sufficient capabilities for the given data set: numerical label not supported".
You tried to use a RandomForest to make a regression, what is not possible, since RandomForest is only capable of doing classificaion tasks.
If you exchange the random Forest with for example the linear regression, then it should run. If you want to optimize the RandomForest, use a dataset with polynomial labels.

Old World Computing - Establishing the Future

Professional consulting for your Data Science problems

Re: OutOfMemoryError

Hi Sebastian,

thank you for your hint. With your suggestions the heap error
does not occur any more.  Smiley Happy