RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Optimize Selection (FORWARDS): what is it that fills up the memory?

mafern76mafern76 Member Posts: 45 Contributor II
edited November 2018 in Help
I'm running a forwards with a linear regression, 300/300 (T/F) cases with 800 attributes.. X-Val 3 folded (Parallel) running on 8 threads.

Within a couple hundered tryouts with just 1 attribute, memory gets maxed at about 10GB and execution time increases exponentially.. maybe these pictures help understand:

image
image

I was wondering why this is happening, as there seems to be some kind of garbage data inescapably, exponentially filling my memory.

I thought maybe the X-Val subsets where left in memory and that was the problem, but it wasn't, no Val and still same issue.

Tried a Free Memory after X-Val (inside optimize selection) but it seems to be worse.

I can't figure out why this behaviour in memory consumption is normal... there's something wrong, right? There must be something I can do about this.

Thanks a lot for your insight.

Regards.

Answers

  • mafern76mafern76 Member Posts: 45 Contributor II
    Another example with Forward Selection operator and some jitter to see some info about memory consumption.

    I wonder what happens at around 4750 validations..

    image
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,938   RM Engineering
    Hi,

    can you please copy your process setup (xml) here?

    Regards,
    Marco
  • mafern76mafern76 Member Posts: 45 Contributor II
    Sure thing!! Thanks for your help!!

    Some metadata:

    800 attributes, binominal classification, 300/300 examples.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="120">
           <parameter key="column_separators" value=","/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <parameter key="encoding" value="windows-1252"/>
         </operator>
         <operator activated="true" class="sample" compatibility="5.3.015" expanded="true" height="76" name="Sample" width="90" x="447" y="120">
           <parameter key="balance_data" value="true"/>
           <list key="sample_size_per_class">
             <parameter key="F" value="300"/>
             <parameter key="T" value="300"/>
           </list>
           <list key="sample_ratio_per_class"/>
           <list key="sample_probability_per_class"/>
         </operator>
         <operator activated="true" class="free_memory" compatibility="5.3.015" expanded="true" height="76" name="Free Memory" width="90" x="581" y="120"/>
         <operator activated="true" class="optimize_selection_forward" compatibility="5.3.015" expanded="true" height="94" name="Forward Selection" width="90" x="581" y="345">
           <parameter key="maximal_number_of_attributes" value="50"/>
           <process expanded="true">
             <operator activated="true" class="parallel:x_validation_parallel" compatibility="5.3.000" expanded="true" height="112" name="Validation" width="90" x="447" y="210">
               <parameter key="number_of_validations" value="3"/>
               <parameter key="number_of_threads" value="8"/>
               <process expanded="true">
                 <operator activated="true" class="linear_regression" compatibility="5.3.015" expanded="true" height="94" name="Linear Regression" width="90" x="216" y="30">
                   <parameter key="feature_selection" value="none"/>
                   <parameter key="eliminate_colinear_features" value="false"/>
                   <parameter key="ridge" value="0.200171731253423"/>
                 </operator>
                 <connect from_port="training" to_op="Linear Regression" to_port="training set"/>
                 <connect from_op="Linear Regression" from_port="model" to_port="model"/>
                 <portSpacing port="source_training" spacing="0"/>
                 <portSpacing port="sink_model" spacing="0"/>
                 <portSpacing port="sink_through 1" spacing="0"/>
               </process>
               <process expanded="true">
                 <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                   <list key="application_parameters"/>
                 </operator>
                 <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance" width="90" x="330" y="30">
                   <parameter key="main_criterion" value="AUC"/>
                   <parameter key="accuracy" value="false"/>
                   <parameter key="AUC" value="true"/>
                 </operator>
                 <connect from_port="model" to_op="Apply Model" to_port="model"/>
                 <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                 <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                 <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                 <portSpacing port="source_model" spacing="0"/>
                 <portSpacing port="source_test set" spacing="0"/>
                 <portSpacing port="source_through 1" spacing="0"/>
                 <portSpacing port="sink_averagable 1" spacing="0"/>
                 <portSpacing port="sink_averagable 2" spacing="0"/>
               </process>
             </operator>
             <operator activated="true" class="log" compatibility="5.3.015" expanded="true" height="76" name="Log" width="90" x="581" y="75">
               <list key="log">
                 <parameter key="time" value="operator.Process.value.time"/>
                 <parameter key="memory" value="operator.Process.value.memory"/>
                 <parameter key="val_count" value="operator.Validation.value.applycount"/>
                 <parameter key="perfo" value="operator.Validation.value.performance"/>
                 <parameter key="feat_names" value="operator.Forward Selection.value.feature_names"/>
                 <parameter key="number_of_attributes" value="operator.Forward Selection.value.number of attributes"/>
               </list>
             </operator>
             <connect from_port="example set" to_op="Validation" to_port="training"/>
             <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
             <connect from_op="Log" from_port="through 1" to_port="performance"/>
             <portSpacing port="source_example set" spacing="0"/>
             <portSpacing port="sink_performance" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Sample" to_port="example set input"/>
         <connect from_op="Sample" from_port="example set output" to_op="Free Memory" to_port="through 1"/>
         <connect from_op="Free Memory" from_port="through 1" to_op="Forward Selection" to_port="example set"/>
         <connect from_op="Forward Selection" from_port="attribute weights" to_port="result 1"/>
         <connect from_op="Forward Selection" from_port="performance" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    This is the forwards via the optimize selection operator, it's just default but showing stop dialog and adding user result individual selection. Everything else same as before.
    <operator activated="true" class="optimize_selection" compatibility="5.3.015" expanded="true" height="94" name="Optimize Selection" width="90" x="581" y="345">
           <parameter key="show_stop_dialog" value="true"/>
           <parameter key="user_result_individual_selection" value="true"/>
    </operator>
    On the other hand I'm doing an evolutionary selection with roughly 400 attributes:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="parallel:optimize_selection_evolutionary_parallel" compatibility="5.3.000" expanded="true" height="94" name="Optimize Selection (2)" width="90" x="648" y="255">
       <parameter key="use_exact_number_of_attributes" value="false"/>
       <parameter key="restrict_maximum" value="false"/>
       <parameter key="min_number_of_attributes" value="5"/>
       <parameter key="max_number_of_attributes" value="1"/>
       <parameter key="exact_number_of_attributes" value="1"/>
       <parameter key="initialize_with_input_weights" value="false"/>
       <parameter key="population_size" value="30"/>
       <parameter key="maximum_number_of_generations" value="300"/>
       <parameter key="use_early_stopping" value="true"/>
       <parameter key="generations_without_improval" value="30"/>
       <parameter key="normalize_weights" value="true"/>
       <parameter key="use_local_random_seed" value="false"/>
       <parameter key="local_random_seed" value="1992"/>
       <parameter key="show_stop_dialog" value="true"/>
       <parameter key="user_result_individual_selection" value="true"/>
       <parameter key="show_population_plotter" value="false"/>
       <parameter key="plot_generations" value="10"/>
       <parameter key="constraint_draw_range" value="false"/>
       <parameter key="draw_dominated_points" value="true"/>
       <parameter key="maximal_fitness" value="Infinity"/>
       <parameter key="selection_scheme" value="tournament"/>
       <parameter key="tournament_size" value="0.25"/>
       <parameter key="start_temperature" value="1.0"/>
       <parameter key="dynamic_selection_pressure" value="true"/>
       <parameter key="keep_best_individual" value="false"/>
       <parameter key="save_intermediate_weights" value="false"/>
       <parameter key="intermediate_weights_generations" value="10"/>
       <parameter key="p_initialize" value="0.5"/>
       <parameter key="p_mutation" value="-1.0"/>
       <parameter key="p_crossover" value="0.5"/>
       <parameter key="crossover_type" value="uniform"/>
       <parameter key="number_of_threads" value="6"/>
       <parameter key="parallelize_evaluation_process" value="false"/>
       <process expanded="true">
         <operator activated="true" class="x_validation" compatibility="5.3.015" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
           <parameter key="create_complete_model" value="false"/>
           <parameter key="average_performances_only" value="true"/>
           <parameter key="leave_one_out" value="false"/>
           <parameter key="number_of_validations" value="3"/>
           <parameter key="sampling_type" value="stratified sampling"/>
           <parameter key="use_local_random_seed" value="false"/>
           <parameter key="local_random_seed" value="1992"/>
           <parameter key="parallelize_training" value="false"/>
           <parameter key="parallelize_testing" value="false"/>
           <process expanded="true">
             <operator activated="true" class="linear_regression" compatibility="5.3.015" expanded="true" height="94" name="Linear Regression" width="90" x="45" y="30">
               <parameter key="feature_selection" value="none"/>
               <parameter key="alpha" value="0.05"/>
               <parameter key="max_iterations" value="10"/>
               <parameter key="forward_alpha" value="0.05"/>
               <parameter key="backward_alpha" value="0.05"/>
               <parameter key="eliminate_colinear_features" value="false"/>
               <parameter key="min_tolerance" value="0.05"/>
               <parameter key="use_bias" value="true"/>
               <parameter key="ridge" value="0.200171731253423"/>
             </operator>
             <connect from_port="training" to_op="Linear Regression" to_port="training set"/>
             <connect from_op="Linear Regression" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true">
             <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model" width="90" x="246" y="30">
               <list key="application_parameters"/>
               <parameter key="create_view" value="false"/>
             </operator>
             <operator activated="true" class="performance_binominal_classification" compatibility="5.3.015" expanded="true" height="76" name="Performance" width="90" x="447" y="30">
               <parameter key="main_criterion" value="first"/>
               <parameter key="accuracy" value="false"/>
               <parameter key="classification_error" value="false"/>
               <parameter key="kappa" value="false"/>
               <parameter key="AUC (optimistic)" value="false"/>
               <parameter key="AUC" value="true"/>
               <parameter key="AUC (pessimistic)" value="false"/>
               <parameter key="precision" value="false"/>
               <parameter key="recall" value="false"/>
               <parameter key="lift" value="false"/>
               <parameter key="fallout" value="false"/>
               <parameter key="f_measure" value="false"/>
               <parameter key="false_positive" value="false"/>
               <parameter key="false_negative" value="false"/>
               <parameter key="true_positive" value="false"/>
               <parameter key="true_negative" value="false"/>
               <parameter key="sensitivity" value="false"/>
               <parameter key="specificity" value="false"/>
               <parameter key="youden" value="false"/>
               <parameter key="positive_predictive_value" value="false"/>
               <parameter key="negative_predictive_value" value="false"/>
               <parameter key="psep" value="false"/>
               <parameter key="skip_undefined_labels" value="true"/>
               <parameter key="use_example_weights" value="true"/>
             </operator>
             <connect from_port="model" to_op="Apply Model" to_port="model"/>
             <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
             <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
             <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="log" compatibility="5.3.015" expanded="true" height="76" name="Log" width="90" x="581" y="120">
           <list key="log">
             <parameter key="time" value="operator.Process.value.time"/>
             <parameter key="memo" value="operator.Process.value.memory"/>
             <parameter key="val_count" value="operator.Validation.value.applycount"/>
             <parameter key="perfo" value="operator.Validation.value.performance"/>
             <parameter key="feat_names" value="operator.Optimize Selection (2).value.feature_names"/>
             <parameter key="gen" value="operator.Optimize Selection (2).value.generation"/>
           </list>
           <parameter key="sorting_type" value="none"/>
           <parameter key="sorting_k" value="100"/>
           <parameter key="persistent" value="false"/>
         </operator>
         <connect from_port="example set" to_op="Validation" to_port="training"/>
         <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
         <connect from_op="Log" from_port="through 1" to_port="performance"/>
         <portSpacing port="source_example set" spacing="0"/>
         <portSpacing port="source_through 1" spacing="0"/>
         <portSpacing port="sink_performance" spacing="0"/>
       </process>
     </operator>
    </process>
    And it's having no memory problems at all, TOTAL is stuck at 700MB while the forwards overloaded it at 11GB. Also time/val_count is lineal.

    Thanks!!

  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,938   RM Engineering
    Hi,

    I have created an issue in our internal tracker. Unfortunately there is not much else I can do at this point :-\

    Regards,
    Marco
  • mafern76mafern76 Member Posts: 45 Contributor II
    Thanks Marco, regards.
  • Dominik_HalfkanDominik_Halfkan Member Posts: 2 Contributor I
    Hi, I'm the student developer in charge of fixing this bug but I'm having problems reproducing it. I set a max memory of 1 GB, generated some data with 1200 examples (600 true, 600 false) and 800 attributes and fed it into your process but it doesn't seem to ever consume more than the set 1 GB of memory. Do you ever get an OutOfMemoryException or some other exception when you run your process or does it just fill up your memory but the application still works?

    Here is the process I used which doesn't seem to cause memory problems:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.004">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="generate_data" compatibility="6.0.004" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
           <parameter key="number_examples" value="600"/>
           <parameter key="number_of_attributes" value="800"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="6.0.004" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="label"/>
           <parameter key="invert_selection" value="true"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="6.0.004" expanded="true" height="94" name="Multiply" width="90" x="112" y="210"/>
         <operator activated="true" class="generate_attributes" compatibility="6.0.004" expanded="true" height="76" name="Generate Attributes" width="90" x="246" y="165">
           <list key="function_descriptions">
             <parameter key="nominal_label" value="&quot;false&quot;"/>
           </list>
         </operator>
         <operator activated="true" class="set_role" compatibility="6.0.004" expanded="true" height="76" name="Set Role" width="90" x="380" y="165">
           <parameter key="attribute_name" value="nominal_label"/>
           <parameter key="target_role" value="label"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.0.004" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="246" y="255">
           <list key="function_descriptions">
             <parameter key="nominal_label" value="&quot;true&quot;"/>
           </list>
         </operator>
         <operator activated="true" class="set_role" compatibility="6.0.004" expanded="true" height="76" name="Set Role (2)" width="90" x="380" y="255">
           <parameter key="attribute_name" value="nominal_label"/>
           <parameter key="target_role" value="label"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="append" compatibility="6.0.004" expanded="true" height="94" name="Append" width="90" x="514" y="210"/>
         <operator activated="true" class="nominal_to_binominal" compatibility="6.0.004" expanded="true" height="94" name="Nominal to Binominal" width="90" x="648" y="210">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="nominal_label"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="sample" compatibility="6.0.004" expanded="true" height="76" name="Sample" width="90" x="447" y="390">
           <parameter key="balance_data" value="true"/>
           <list key="sample_size_per_class">
             <parameter key="false" value="300"/>
           </list>
           <list key="sample_ratio_per_class"/>
           <list key="sample_probability_per_class"/>
         </operator>
         <operator activated="true" class="free_memory" compatibility="6.0.004" expanded="true" height="76" name="Free Memory" width="90" x="581" y="390"/>
         <operator activated="true" class="optimize_selection_forward" compatibility="6.0.004" expanded="true" height="94" name="Forward Selection" width="90" x="782" y="345">
           <parameter key="maximal_number_of_attributes" value="50"/>
           <process expanded="true">
             <operator activated="true" class="parallel:x_validation_parallel" compatibility="5.3.000" expanded="true" height="112" name="Validation" width="90" x="447" y="210">
               <parameter key="number_of_validations" value="3"/>
               <parameter key="number_of_threads" value="8"/>
               <process expanded="true">
                 <operator activated="true" class="linear_regression" compatibility="6.0.004" expanded="true" height="94" name="Linear Regression" width="90" x="216" y="30">
                   <parameter key="feature_selection" value="none"/>
                   <parameter key="eliminate_colinear_features" value="false"/>
                   <parameter key="ridge" value="0.200171731253423"/>
                 </operator>
                 <connect from_port="training" to_op="Linear Regression" to_port="training set"/>
                 <connect from_op="Linear Regression" from_port="model" to_port="model"/>
                 <portSpacing port="source_training" spacing="0"/>
                 <portSpacing port="sink_model" spacing="0"/>
                 <portSpacing port="sink_through 1" spacing="0"/>
               </process>
               <process expanded="true">
                 <operator activated="true" class="apply_model" compatibility="6.0.004" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                   <list key="application_parameters"/>
                 </operator>
                 <operator activated="true" class="performance_binominal_classification" compatibility="6.0.004" expanded="true" height="76" name="Performance" width="90" x="330" y="30">
                   <parameter key="main_criterion" value="AUC"/>
                   <parameter key="accuracy" value="false"/>
                   <parameter key="AUC" value="true"/>
                 </operator>
                 <connect from_port="model" to_op="Apply Model" to_port="model"/>
                 <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                 <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                 <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                 <portSpacing port="source_model" spacing="0"/>
                 <portSpacing port="source_test set" spacing="0"/>
                 <portSpacing port="source_through 1" spacing="0"/>
                 <portSpacing port="sink_averagable 1" spacing="0"/>
                 <portSpacing port="sink_averagable 2" spacing="0"/>
               </process>
             </operator>
             <operator activated="true" class="log" compatibility="6.0.004" expanded="true" height="76" name="Log" width="90" x="581" y="75">
               <parameter key="filename" value="/home/halfkann/Documents/memory_leak.log"/>
               <list key="log">
                 <parameter key="time" value="operator.Process.value.time"/>
                 <parameter key="memory" value="operator.Process.value.memory"/>
                 <parameter key="val_count" value="operator.Validation.value.applycount"/>
                 <parameter key="perfo" value="operator.Validation.value.performance"/>
                 <parameter key="feat_names" value="operator.Forward Selection.value.feature_names"/>
                 <parameter key="number_of_attributes" value="operator.Forward Selection.value.number of attributes"/>
               </list>
             </operator>
             <connect from_port="example set" to_op="Validation" to_port="training"/>
             <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
             <connect from_op="Log" from_port="through 1" to_port="performance"/>
             <portSpacing port="source_example set" spacing="0"/>
             <portSpacing port="sink_performance" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
         <connect from_op="Multiply" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes (2)" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Append" to_port="example set 1"/>
         <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
         <connect from_op="Set Role (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
         <connect from_op="Append" from_port="merged set" to_op="Nominal to Binominal" to_port="example set input"/>
         <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Sample" to_port="example set input"/>
         <connect from_op="Sample" from_port="example set output" to_op="Free Memory" to_port="through 1"/>
         <connect from_op="Free Memory" from_port="through 1" to_op="Forward Selection" to_port="example set"/>
         <connect from_op="Forward Selection" from_port="attribute weights" to_port="result 1"/>
         <connect from_op="Forward Selection" from_port="performance" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
Sign In or Register to comment.