Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Parameter Optimization - Why do you optimize on test data

SebastianB12SebastianB12 Member Posts: 2 Contributor I
edited November 2018 in Help

Hi all,

I'm wondering why in the tutorial process and also in other processes, which were pasted here in the forum, parameters are optimized on the test data and not on the training data. If I'm understanding the Optimize Parameters (Grid) operator correctly the performance of the connected performance vector at the sink is optimized. For me this is kind of cheating to optimize on out of sample test data. Sometimes this results in better test performance (not in this case) than training performance, which is kind of strange ;).

Could you clarify that for me?

I changed the tutorial Process a bit to show you how I would connect everything. But perhaps I just misunderstand the optimziation operator.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Weighting" width="90" x="112" y="30">
<parameter key="repository_entry" value="//Samples/data/Weighting"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="6.0.003" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="313" y="30">
<list key="parameters">
<parameter key="SVM.C" value="[0.001;100000;10;logarithmic]"/>
<parameter key="SVM.gamma" value="[0.001;1.5;10;logarithmic]"/>
</list>
<parameter key="error_handling" value="fail on error"/>
<process expanded="true">
<operator activated="true" class="split_data" compatibility="7.2.002" expanded="true" height="103" name="Split Data" width="90" x="45" y="30">
<enumeration key="partitions">
<parameter key="ratio" value="0.5"/>
<parameter key="ratio" value="0.5"/>
</enumeration>
<parameter key="sampling_type" value="shuffled sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="support_vector_machine_libsvm" compatibility="7.2.002" expanded="true" height="82" name="SVM" width="90" x="179" y="255">
<parameter key="svm_type" value="C-SVC"/>
<parameter key="kernel_type" value="rbf"/>
<parameter key="degree" value="3"/>
<parameter key="gamma" value="1.5"/>
<parameter key="coef0" value="0.0"/>
<parameter key="C" value="100000.0"/>
<parameter key="nu" value="0.5"/>
<parameter key="cache_size" value="80"/>
<parameter key="epsilon" value="0.001"/>
<parameter key="p" value="0.1"/>
<list key="class_weights"/>
<parameter key="shrinking" value="true"/>
<parameter key="calculate_confidences" value="false"/>
<parameter key="confidence_for_multiclass" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.002" expanded="true" height="103" name="Multiply" width="90" x="313" y="120"/>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="30">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="255">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.002" expanded="true" height="82" name="Performance In Sample" width="90" x="581" y="238">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.002" expanded="true" height="82" name="Performance Out of Sample" width="90" x="581" y="30">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="input 1" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Split Data" from_port="partition 2" to_op="SVM" to_port="training set"/>
<connect from_op="SVM" from_port="model" to_op="Multiply" to_port="input"/>
<connect from_op="SVM" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 1" to_op="Apply Model" to_port="model"/>
<connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance Out of Sample" to_port="labelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance In Sample" to_port="labelled data"/>
<connect from_op="Performance In Sample" from_port="performance" to_port="performance"/>
<connect from_op="Performance Out of Sample" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="legacy:write_parameters" compatibility="7.2.002" expanded="true" height="68" name="Write Parameters" width="90" x="514" y="30">
<parameter key="parameter_file" value="D:\parameters.txt"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<connect from_op="Weighting" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

Thanks a lot!

Cheers

Sebastian

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Solution Accepted

    Hi Sebastian,

     

    Sorry @BalazsBarany, i need to disagree. Taking the performance of a optimization is biased. You can easily built yourself a overtrained model this way. If you think about optimization as a kind of model fitting itself, you can clearly see that there needs to be a outer validation around the optimization/

     

     

    That's why the best way to do this in my opinion is the way Sebastian points out.

     

    What you can do is, to use a outer X-Val around the parameter optimization instead of the hold-out, but that is usually not feasabile from a runtime perspective.

     

    Talking about runtime, it might be necessary to do a split validation instead of a x-val because of runtime.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi!

     

    What your process does is a Split Validation. This is appropriate if you have a lot of data.

     

    However, in most scenarios, a 10-fold cross validation (X-Validation in RapidMiner) is better as it 1. works on a larger percentage of the data and 2. tests all examples of the input set. (The drawback is the longer runtime, as you're building N+1 models.)

     

    If you optimize on the performance of a cross validation, you're not cheating. It makes sure that the test data never went into the models. 

     

    Regards,

     

    Balázs

  • SebastianB12SebastianB12 Member Posts: 2 Contributor I

    Hi!

    Thank you for your answer!

    Just to doublecheck, if I understood it correctly. The "best/correct" way to train a model and optimize its hyperparameters in Rapidminer is:

    1. Split up the dataset in training and test data (e.g. stratified sampling)
    2. Take the training set and perform an parameter optimization (e.g. grid) with a nested X-Validation
    3. Take the optimized model and test it on the test data to get an unbiased estimate of its performance.

    Is that correct? This is how I understood your answer and scenario 2 from these FAQs http://sebastianraschka.com/faq/docs/evaluate-a-model.html .

    optimization.JPG

    The other way would be to just do step number 2 on the whole data set and to take the estimated peformance of the X-Validation as the holdout performance. But if I understand the X-Validation and the parameter optimization operator correctly, these results would be biased.

     

    Thanks again for your help!!

     

    Cheers

    Sebastian

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi Sebastian,

     

    the results of a cross-validation aren't biased. The cross-validation process makes sure to never test on data that have been used for building the model being tested. 

     

    So it's a valid approach to send your entire dataset into Optimize Parameters (Grid) and do a 10-fold X-Validation inside it.

     

    If you have a lot of data (or can get new ones by e. g. waiting a day), you can of course use the optimized model on entirely new data.

     

Sign In or Register to comment.