Options

Incremental Results in Optimize Parameters?

srt19170srt19170 Member Posts: 44 Contributor II
edited November 2018 in Help
I have an Optimize Parameters process with 80 features that runs a long time and invariably crashes with memory problems.  Is there a way to have the process incrementally report the best solution, so that I could (for example) restart using the last reported solution?  Thanks for any help!

-- Scott Turner

Answers

  • Options
    homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist
    Hi Scott.

    There are some options to solve your problem. At first you may try allocate more java heap memory by using the well known -Xmx switch. You may also try to reduce the complexity of your process. Finally (and that's what you asked for..) you may use the Log operator to enable incremental logging. Just drag a Log operator instance into your optimization operator and define which values you want to log. Very important: do not forget to enable the persistent check box. This ensures that the file will be written immediately after each optimization step. Here is an example of how such a process may look like:
    <process version="5.1.009">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.009" expanded="true" name="Root">  
       <process expanded="true" height="584" width="962">
         <operator activated="true" class="retrieve" compatibility="5.1.009" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
           <parameter key="repository_entry" value="../../data/Polynomial"/>
         </operator>
         <operator activated="true" class="optimize_parameters_grid" compatibility="5.1.009" expanded="true" height="94" name="ParameterOptimization" width="90" x="246" y="30">
           <list key="parameters">
             <parameter key="Training.C" value="50,100,150,200,250"/>
             <parameter key="Training.degree" value="1,2,3,4,5"/>
           </list>
           <process expanded="true" height="626" width="806">
             <operator activated="true" class="x_validation" compatibility="5.1.009" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
               <parameter key="sampling_type" value="shuffled sampling"/>
               <process expanded="true" height="272" width="334">
                 <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.1.009" expanded="true" height="76" name="Training" width="90" x="126" y="30">
                   <parameter key="svm_type" value="epsilon-SVR"/>
                   <parameter key="kernel_type" value="poly"/>
                   <parameter key="degree" value="5"/>
                   <parameter key="C" value="250"/>
                   <parameter key="epsilon" value="0.01"/>
                   <list key="class_weights"/>
                 </operator>
                 <connect from_port="training" to_op="Training" to_port="training set"/>
                 <connect from_op="Training" from_port="model" to_port="model"/>
                 <portSpacing port="source_training" spacing="0"/>
                 <portSpacing port="sink_model" spacing="0"/>
                 <portSpacing port="sink_through 1" spacing="0"/>
               </process>
               <process expanded="true" height="272" width="334">
                 <operator activated="true" class="apply_model" compatibility="5.1.009" expanded="true" height="76" name="Test" width="90" x="45" y="30">
                   <list key="application_parameters"/>
                 </operator>
                 <operator activated="true" class="performance_regression" compatibility="5.1.009" expanded="true" height="76" name="Evaluation" width="90" x="194" y="30">
                   <parameter key="absolute_error" value="true"/>
                   <parameter key="normalized_absolute_error" value="true"/>
                   <parameter key="squared_error" value="true"/>
                 </operator>
                 <connect from_port="model" to_op="Test" to_port="model"/>
                 <connect from_port="test set" to_op="Test" to_port="unlabelled data"/>
                 <connect from_op="Test" from_port="labelled data" to_op="Evaluation" to_port="labelled data"/>
                 <connect from_op="Evaluation" from_port="performance" to_port="averagable 1"/>
                 <portSpacing port="source_model" spacing="0"/>
                 <portSpacing port="source_test set" spacing="0"/>
                 <portSpacing port="source_through 1" spacing="0"/>
                 <portSpacing port="sink_averagable 1" spacing="0"/>
                 <portSpacing port="sink_averagable 2" spacing="0"/>
               </process>
             </operator>
             <operator activated="true" class="log" compatibility="5.1.009" expanded="true" height="76" name="Log" width="90" x="246" y="75">
               <parameter key="filename" value="paraopt.log"/>
               <list key="log">
                 <parameter key="C" value="operator.Training.parameter.C"/>
                 <parameter key="degree" value="operator.Training.parameter.degree"/>
                 <parameter key="absolute" value="operator.Validation.value.performance"/>
               </list>
               <parameter key="persistent" value="true"/>
             </operator>
             <connect from_port="input 1" to_op="Validation" to_port="training"/>
             <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
             <connect from_op="Log" from_port="through 1" to_port="performance"/>
             <portSpacing port="source_input 1" spacing="0"/>
             <portSpacing port="source_input 2" spacing="0"/>
             <portSpacing port="sink_performance" spacing="0"/>
             <portSpacing port="sink_result 1" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Retrieve" from_port="output" to_op="ParameterOptimization" to_port="input 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
       </process>
     </operator>
    </process>
    Cheers,
       Helge
  • Options
    srt19170srt19170 Member Posts: 44 Contributor II
    Helge --

    Thanks for your response, but I'm afraid I was sloppy in my original post and wrote that I was optimizing parameters, when in fact I'm optimizing feature selection.  So a simplified version of my process looks like this:

    <process version="5.1.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.009" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true" height="694" width="740">
          <operator activated="true" class="read_csv" compatibility="5.1.009" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="csv_file" value="H:\PRIVATE\Personal\Lisp\NCAA\Predictor12\Data\all-data-train.csv"/>
            <parameter key="column_separators" value=",\s*|;\s*"/>
            <parameter key="trim_lines" value="true"/>
            <parameter key="use_quotes" value="true"/>
            <parameter key="quotes_character" value="&quot;"/>
            <parameter key="escape_character_for_quotes" value="\"/>
            <parameter key="skip_comments" value="true"/>
            <parameter key="comment_characters" value="#"/>
            <parameter key="parse_numbers" value="true"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="date_format" value="yyyy-MM-dd"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Date.true.date.attribute"/>
              <parameter key="1" value="Line.true.binominal.attribute"/>
              <parameter key="2" value="Hname.true.polynominal.attribute"/>
              <parameter key="3" value="Hscore.true.real.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
          </operator>
          <operator activated="true" class="optimize_selection" compatibility="5.1.009" expanded="true" height="94" name="Optimize Selection" width="90" x="313" y="30">
            <parameter key="selection_direction" value="forward"/>
            <parameter key="limit_generations_without_improval" value="true"/>
            <parameter key="generations_without_improval" value="2"/>
            <parameter key="limit_number_of_generations" value="false"/>
            <parameter key="keep_best" value="1"/>
            <parameter key="maximum_number_of_generations" value="20"/>
            <parameter key="normalize_weights" value="true"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="show_stop_dialog" value="true"/>
            <parameter key="user_result_individual_selection" value="false"/>
            <parameter key="show_population_plotter" value="false"/>
            <parameter key="plot_generations" value="10"/>
            <parameter key="constraint_draw_range" value="false"/>
            <parameter key="draw_dominated_points" value="true"/>
            <parameter key="maximal_fitness" value="Infinity"/>
            <parameter key="parallelize_evaluation_process" value="false"/>
            <process expanded="true" height="759" width="650">
              <operator activated="true" class="linear_regression" compatibility="5.1.009" expanded="true" height="94" name="Linear Regression" width="90" x="112" y="30">
                <parameter key="feature_selection" value="M5 prime"/>
                <parameter key="alpha" value="0.05"/>
                <parameter key="max_iterations" value="10"/>
                <parameter key="forward_alpha" value="0.05"/>
                <parameter key="backward_alpha" value="0.05"/>
                <parameter key="eliminate_colinear_features" value="true"/>
                <parameter key="min_tolerance" value="0.05"/>
                <parameter key="use_bias" value="true"/>
                <parameter key="ridge" value="1.0E-8"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="5.1.009" expanded="true" height="76" name="Apply Model (3)" width="90" x="246" y="30">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_regression" compatibility="5.1.009" expanded="true" height="76" name="Performance (3)" width="90" x="380" y="30">
                <parameter key="main_criterion" value="first"/>
                <parameter key="root_mean_squared_error" value="true"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="prediction_average" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="example set" to_op="Linear Regression" to_port="training set"/>
              <connect from_op="Linear Regression" from_port="model" to_op="Apply Model (3)" to_port="model"/>
              <connect from_op="Linear Regression" from_port="exampleSet" to_op="Apply Model (3)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
              <connect from_op="Performance (3)" from_port="performance" to_port="performance"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Optimize Selection" to_port="example set in"/>
          <connect from_op="Optimize Selection" from_port="example set out" to_port="result 2"/>
          <connect from_op="Optimize Selection" from_port="performance" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    I can certainly insert a Log operator into the evaluation process, but I don't see anything useful to log there.  I want to log the current set of selected features (along with the performance) but I don't see any way to get the current set of selected features.  Any ideas?

    -- Scott
Sign In or Register to comment.