Options

Cross Validation with Smote Upsampling

OprickOprick Member Posts: 35 Contributor II
edited July 2019 in Help
Hi all,
I see that there are already some discussions in this community about this subject. However I still have some doubts.

I have a process, in which there is a class imbalance and the minority class is the most important. SMOTE upsampling seems to provide good results. I say "seems" because I have doubts on how to correctly validate it. 

My approach was to train the model with upsampled data and test the model with 20% hold out (partitioned before upsampling).

I guess that this is the most correct thing to do 'cause real data is not upsampled. But what is the most correct way to validate the model? I used the 20% hold out in the testing part of CV operator (using remember and recall).

What are your thoughts?
Please trash my approach if you think so :smile:  

(enclosed a mock example data set and RM process file)

Thanks,
Pedro

Best Answer

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    edited February 2019 Solution Accepted
    Rather than using the holdout approach, I would recommend putting your SMOTE upsampling inside your cross-validation itself.  
    The problem with your approach is that the results are highly dependent on the initial sample, which is only drawn once.
    See the revised process attached.  Your process won't actually run because you didn't set the label but once you do that you should be able to compare your original process to my revised version.
    <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.2.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
            <parameter key="excel_file" value="C:\Users\brian\Downloads\Data Source Example.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="V1a.true.integer.attribute"/>
              <parameter key="1" value="V1b.true.integer.attribute"/>
              <parameter key="2" value="V2a.true.integer.attribute"/>
              <parameter key="3" value="V2b.true.integer.attribute"/>
              <parameter key="4" value="CR1.true.integer.attribute"/>
              <parameter key="5" value="Score.true.real.attribute"/>
              <parameter key="6" value="Date.true.date_time.attribute"/>
              <parameter key="7" value="Week.true.integer.attribute"/>
              <parameter key="8" value="Year.true.integer.attribute"/>
              <parameter key="9" value="COUNTRYOFORIGIN.true.polynominal.attribute"/>
              <parameter key="10" value="xScore.true.real.attribute"/>
              <parameter key="11" value="yScore.true.real.attribute"/>
              <parameter key="12" value="mat.true.integer.attribute"/>
              <parameter key="13" value="Average of Solar Rad Avg.true.real.attribute"/>
              <parameter key="14" value="Average of mV.true.real.attribute"/>
              <parameter key="15" value="Average of Air ºC Avg.true.real.attribute"/>
              <parameter key="16" value="Max of Air ºC Avg Max.true.real.attribute"/>
              <parameter key="17" value="Min of Air ºC Avg Min.true.real.attribute"/>
              <parameter key="18" value="Average of Hr % Avg.true.real.attribute"/>
              <parameter key="19" value="Average of Dew Avg.true.real.attribute"/>
              <parameter key="20" value="Min of Dew Min.true.real.attribute"/>
              <parameter key="21" value="Average of Leaf Wet.true.real.attribute"/>
              <parameter key="22" value="Sum of mmpp.true.real.attribute"/>
              <parameter key="23" value="EXP.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="9.2.000" expanded="true" height="82" name="Prepper" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes (7)" width="90" x="45" y="34">
                <list key="function_descriptions">
                  <parameter key="Month" value="date_get(Date,DATE_UNIT_MONTH)+1"/>
                  <parameter key="Vn" value="(V1a+V1b)*(sqrt(yScore))*(sqrt([Average of Hr % Avg]))"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="App Score|Average of Air ºC Avg|Average of Dew Avg|Average of Hr % Avg|Average of Leaf Wet|Average of mV|Average of Solar Rad Avg|CR1|EXP|Max of Air ºC Avg Max|Min of Air ºC Avg Min|Min of Dew Min|Month|Sum of mmpp|V1a|V1b|Vn|V2a|V2b|mat|Score|Week|xScore|Year|yScore|COUNTRYOFORIGIN"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal (4)" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="Month|Week|Year|mat"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="normalize" compatibility="9.2.000" expanded="true" height="103" name="Normalize" width="90" x="514" y="34">
                <parameter key="return_preprocessing_model" value="false"/>
                <parameter key="create_view" value="false"/>
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="Average of Air ºC Avg|Average of Dew Avg|Average of Hr % Avg|Average of Leaf Wet|Average of mV|Average of Solar Rad Avg|CR1|Max of Air ºC Avg Max|Min of Air ºC Avg Min|Min of Dew Min|Sum of mmpp|V1a|V1b|V2a|V2b|xScore|Score|yScore|Vn"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="method" value="Z-transformation"/>
                <parameter key="min" value="0.0"/>
                <parameter key="max" value="1.0"/>
                <parameter key="allow_negative_values" value="false"/>
              </operator>
              <connect from_port="in 1" to_op="Generate Attributes (7)" to_port="example set input"/>
              <connect from_op="Generate Attributes (7)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Numerical to Polynominal (4)" to_port="example set input"/>
              <connect from_op="Numerical to Polynominal (4)" from_port="example set output" to_op="Normalize" to_port="example set input"/>
              <connect from_op="Normalize" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.2.000" expanded="true" height="145" name="S (2)" width="90" x="447" y="34">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="5"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="false" class="split_data" compatibility="9.2.000" expanded="true" height="82" name="Split Data" width="90" x="246" y="136">
                <enumeration key="partitions">
                  <parameter key="ratio" value="0.8"/>
                  <parameter key="ratio" value="0.2"/>
                </enumeration>
                <parameter key="sampling_type" value="automatic"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="operator_toolbox:smote" compatibility="1.8.000" expanded="true" height="82" name="SMOTE Upsampling" width="90" x="112" y="34">
                <parameter key="number_of_neighbours" value="5"/>
                <parameter key="normalize" value="true"/>
                <parameter key="equalize_classes" value="true"/>
                <parameter key="upsampling_size" value="1000"/>
                <parameter key="auto_detect_minority_class" value="true"/>
                <parameter key="round_integers" value="true"/>
                <parameter key="nominal_change_rate" value="0.5"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="stacking" compatibility="9.2.000" expanded="true" height="68" name="Stacking" width="90" x="313" y="34">
                <parameter key="keep_all_attributes" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning" width="90" x="112" y="34">
                    <parameter key="activation" value="Rectifier"/>
                    <enumeration key="hidden_layer_sizes">
                      <parameter key="hidden_layer_sizes" value="50"/>
                      <parameter key="hidden_layer_sizes" value="50"/>
                    </enumeration>
                    <enumeration key="hidden_dropout_ratios"/>
                    <parameter key="reproducible_(uses_1_thread)" value="false"/>
                    <parameter key="use_local_random_seed" value="false"/>
                    <parameter key="local_random_seed" value="1992"/>
                    <parameter key="epochs" value="10.0"/>
                    <parameter key="compute_variable_importances" value="false"/>
                    <parameter key="train_samples_per_iteration" value="-2"/>
                    <parameter key="adaptive_rate" value="true"/>
                    <parameter key="epsilon" value="1.0E-8"/>
                    <parameter key="rho" value="0.99"/>
                    <parameter key="learning_rate" value="0.005"/>
                    <parameter key="learning_rate_annealing" value="1.0E-6"/>
                    <parameter key="learning_rate_decay" value="1.0"/>
                    <parameter key="momentum_start" value="0.0"/>
                    <parameter key="momentum_ramp" value="1000000.0"/>
                    <parameter key="momentum_stable" value="0.0"/>
                    <parameter key="nesterov_accelerated_gradient" value="true"/>
                    <parameter key="standardize" value="true"/>
                    <parameter key="L1" value="1.0E-5"/>
                    <parameter key="L2" value="0.0"/>
                    <parameter key="max_w2" value="10.0"/>
                    <parameter key="loss_function" value="Automatic"/>
                    <parameter key="distribution_function" value="AUTO"/>
                    <parameter key="early_stopping" value="false"/>
                    <parameter key="stopping_rounds" value="1"/>
                    <parameter key="stopping_metric" value="AUTO"/>
                    <parameter key="stopping_tolerance" value="0.001"/>
                    <parameter key="missing_values_handling" value="MeanImputation"/>
                    <parameter key="max_runtime_seconds" value="0"/>
                    <list key="expert_parameters"/>
                    <list key="expert_parameters_"/>
                  </operator>
                  <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.2.000" expanded="true" height="103" name="Random Forest" width="90" x="112" y="136">
                    <parameter key="number_of_trees" value="100"/>
                    <parameter key="criterion" value="gain_ratio"/>
                    <parameter key="maximal_depth" value="10"/>
                    <parameter key="apply_pruning" value="false"/>
                    <parameter key="confidence" value="0.1"/>
                    <parameter key="apply_prepruning" value="false"/>
                    <parameter key="minimal_gain" value="0.01"/>
                    <parameter key="minimal_leaf_size" value="2"/>
                    <parameter key="minimal_size_for_split" value="4"/>
                    <parameter key="number_of_prepruning_alternatives" value="3"/>
                    <parameter key="random_splits" value="false"/>
                    <parameter key="guess_subset_ratio" value="true"/>
                    <parameter key="subset_ratio" value="0.2"/>
                    <parameter key="voting_strategy" value="confidence vote"/>
                    <parameter key="use_local_random_seed" value="false"/>
                    <parameter key="local_random_seed" value="1992"/>
                    <parameter key="enable_parallel_execution" value="true"/>
                  </operator>
                  <operator activated="true" class="naive_bayes_kernel" compatibility="9.2.000" expanded="true" height="82" name="Naive Bayes (Kernel)" width="90" x="112" y="289">
                    <parameter key="laplace_correction" value="true"/>
                    <parameter key="estimation_mode" value="greedy"/>
                    <parameter key="bandwidth_selection" value="heuristic"/>
                    <parameter key="bandwidth" value="0.1"/>
                    <parameter key="minimum_bandwidth" value="0.1"/>
                    <parameter key="number_of_kernels" value="10"/>
                    <parameter key="use_application_grid" value="false"/>
                    <parameter key="application_grid_size" value="200"/>
                  </operator>
                  <connect from_port="training set 1" to_op="Deep Learning" to_port="training set"/>
                  <connect from_port="training set 2" to_op="Random Forest" to_port="training set"/>
                  <connect from_port="training set 3" to_op="Naive Bayes (Kernel)" to_port="training set"/>
                  <connect from_op="Deep Learning" from_port="model" to_port="base model 1"/>
                  <connect from_op="Random Forest" from_port="model" to_port="base model 2"/>
                  <connect from_op="Naive Bayes (Kernel)" from_port="model" to_port="base model 3"/>
                  <portSpacing port="source_training set 1" spacing="0"/>
                  <portSpacing port="source_training set 2" spacing="0"/>
                  <portSpacing port="source_training set 3" spacing="0"/>
                  <portSpacing port="source_training set 4" spacing="0"/>
                  <portSpacing port="sink_base model 1" spacing="0"/>
                  <portSpacing port="sink_base model 2" spacing="0"/>
                  <portSpacing port="sink_base model 3" spacing="0"/>
                  <portSpacing port="sink_base model 4" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning (2)" width="90" x="112" y="34">
                    <parameter key="activation" value="Rectifier"/>
                    <enumeration key="hidden_layer_sizes">
                      <parameter key="hidden_layer_sizes" value="50"/>
                      <parameter key="hidden_layer_sizes" value="50"/>
                    </enumeration>
                    <enumeration key="hidden_dropout_ratios"/>
                    <parameter key="reproducible_(uses_1_thread)" value="false"/>
                    <parameter key="use_local_random_seed" value="false"/>
                    <parameter key="local_random_seed" value="1992"/>
                    <parameter key="epochs" value="10.0"/>
                    <parameter key="compute_variable_importances" value="false"/>
                    <parameter key="train_samples_per_iteration" value="-2"/>
                    <parameter key="adaptive_rate" value="true"/>
                    <parameter key="epsilon" value="1.0E-8"/>
                    <parameter key="rho" value="0.99"/>
                    <parameter key="learning_rate" value="0.005"/>
                    <parameter key="learning_rate_annealing" value="1.0E-6"/>
                    <parameter key="learning_rate_decay" value="1.0"/>
                    <parameter key="momentum_start" value="0.0"/>
                    <parameter key="momentum_ramp" value="1000000.0"/>
                    <parameter key="momentum_stable" value="0.0"/>
                    <parameter key="nesterov_accelerated_gradient" value="true"/>
                    <parameter key="standardize" value="true"/>
                    <parameter key="L1" value="1.0E-5"/>
                    <parameter key="L2" value="0.0"/>
                    <parameter key="max_w2" value="10.0"/>
                    <parameter key="loss_function" value="Automatic"/>
                    <parameter key="distribution_function" value="AUTO"/>
                    <parameter key="early_stopping" value="false"/>
                    <parameter key="stopping_rounds" value="1"/>
                    <parameter key="stopping_metric" value="AUTO"/>
                    <parameter key="stopping_tolerance" value="0.001"/>
                    <parameter key="missing_values_handling" value="MeanImputation"/>
                    <parameter key="max_runtime_seconds" value="0"/>
                    <list key="expert_parameters"/>
                    <list key="expert_parameters_"/>
                  </operator>
                  <connect from_port="stacking examples" to_op="Deep Learning (2)" to_port="training set"/>
                  <connect from_op="Deep Learning (2)" from_port="model" to_port="stacking model"/>
                  <portSpacing port="source_stacking examples" spacing="0"/>
                  <portSpacing port="sink_stacking model" spacing="0"/>
                </process>
              </operator>
              <operator activated="false" class="remember" compatibility="9.2.000" expanded="true" height="68" name="Remember" width="90" x="380" y="187">
                <parameter key="name" value="S test set"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="store_which" value="1"/>
                <parameter key="remove_from_process" value="true"/>
              </operator>
              <connect from_port="training set" to_op="SMOTE Upsampling" to_port="exa"/>
              <connect from_op="Split Data" from_port="partition 1" to_op="Remember" to_port="store"/>
              <connect from_op="SMOTE Upsampling" from_port="ups" to_op="Stacking" to_port="training set"/>
              <connect from_op="Stacking" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="false" class="recall" compatibility="9.2.000" expanded="true" height="68" name="Recall" width="90" x="45" y="187">
                <parameter key="name" value="S test set"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance Straws" width="90" x="179" y="34">
                <parameter key="main_criterion" value="classification_error"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="kappa" value="true"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance Straws" to_port="labelled data"/>
              <connect from_op="Performance Straws" from_port="performance" to_port="performance 1"/>
              <connect from_op="Performance Straws" from_port="example set" to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Prepper" to_port="in 1"/>
          <connect from_op="Prepper" from_port="out 1" to_op="S (2)" to_port="example set"/>
          <connect from_op="S (2)" from_port="model" to_port="result 1"/>
          <connect from_op="S (2)" from_port="test result set" to_port="result 2"/>
          <connect from_op="S (2)" from_port="performance 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    



    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Options
    OprickOprick Member Posts: 35 Contributor II
    @Telcontar120 many thanks for your reply. Indeed it makes much more sense than my approach.

    Regards,
    Pedor
Sign In or Register to comment.