Cross-Validation and Grid-Search Optimization

lsevellsevel Member Posts: 18 Contributor II
edited December 2018 in Help

Hi all,

 

I was wondering if I could get some clarification on the correct nesting and setting of parameters to use grid-search optimization within n-fold cross-validation. I assume the optimization operator is nested within cross-validation to select parameters as described in this article: https://rapidminer.com/learn-right-way-validate-models-part-4-accidental-contamination/

 

How is the set parameter operator used to correctly set the parameter in question for the model to be applied to the data after optimization has been performed?

 

Any clarification on these processes would be helpful,

 

Thank you

Answers

  • lsevellsevel Member Posts: 18 Contributor II

    Thank you for this video. 

     

    Just to confirm, if the cross-validation process is nested within a parameter optimization process, will the parameters be optimized for each iteration of cross-validation? My concern is that placing cross-validation inside of the optimize process will optimize on the entire data set rather than separately for each fold, resulting in contamination of the training and testing data. 

     

     

  • FBTFBT Member Posts: 106 Unicorn

    RapidMiner executes everything that is nestend within the "Optimize Parameters" operators for each possible combination of parameters (as you define them). Hence, if you have a list of e.g. 121 possible combinations of parameters, RM will run 121 k-fold cross validations, one for each parameter combination. Therefore, it is important not to try too many combinations at once, otherwise your process can run a very long time, depending on the size and structure of your data. 

     

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    You are right with that, if you optimize the parameters with the same dataset that you will train the model later on, you will have a bias.

     

    Edit: You actually need an independent validation dataset to estimate the model error correctly. It is ok to optimize parameters in the training set.

     

    Here is an article on the matter:

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/

     

    Best,

    Sebastian

  • lsevellsevel Member Posts: 18 Contributor II

    Thank you all for these responses.

     

    To clarify, is it correct that to optimize parameters without bias, the optimization process must be nested within the cross-validation? Not cross-validation nested within optimization?

     

    Thank you all

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    The less biased estimation would be the following:

     

    outer CV with optimize parameters in the training side

    +

    CV inside the Optimize parameters operator

     

    That could take quite a while depending on the model. Sometimes you can do without the outer CV, because the absolute performance of a model is rarely useful (more important is to compare different models, or to use problem specific measures such as cost saving).

     

    You can also speed the process up using less folds and/or using the Optimize Parameters (evolutionary) operator.

  • lsevellsevel Member Posts: 18 Contributor II

    Thank you for this clarification, SGolbert. In using this approach, is it possible to feed the model parameters identified by the optimization process directly into the testing portion of CV? 

     

    Thank you

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi lsevel,

     

    I have researched a bit into your question and I noticed that it is not very well described in the documentation. In short you have two ways of doing it:

     

    1. Inside the optimize parameters operator, deliver the model from the mod port of the trainer (let's say SVM) to the res port. Then outside Opt Par (inside outer CV), deliver that res port to the testing side.

     

    2. Use the set parameters operator. The operator help will provide sufficient guide on this one, basically you need an extra model operator to pass the parameter set to.

     

    I personally find the first solution much simpler, but it's kind of counterintuitive at first, because the documentation says nothing about what you get out of the result ports of optimize parameter. But after some testing, I've found out that you get whatever result the best model delivered.

     

    Best,

    Sebastian

  • lsevellsevel Member Posts: 18 Contributor II
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve" width="90" x="45" y="85"/>
    <operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="124" name="SR_Xval" width="90" x="380" y="136">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="normalize" compatibility="7.5.003" expanded="true" height="103" name="Normalize (5)" width="90" x="45" y="34"/>
    <operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="166" name="Optimize Parameters (5)" width="90" x="313" y="34">
    <list key="parameters">
    <parameter key="SR_LASSO.t" value="[0.001;2;200;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="166" name="SR_CV" width="90" x="313" y="85">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="featselext:lars" compatibility="1.1.004" expanded="true" height="103" name="SR_LASSO" width="90" x="246" y="85">
    <parameter key="t" value="1.170415"/>
    </operator>
    <connect from_port="training" to_op="SR_LASSO" to_port="training set"/>
    <connect from_op="SR_LASSO" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <portSpacing port="sink_through 2" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (12)" width="90" x="45" y="30">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.6.001" expanded="true" height="82" name="Imaging_Performance (9)" width="90" x="179" y="30"/>
    <connect from_port="model" to_op="Apply Model (12)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (12)" to_port="unlabelled data"/>
    <connect from_port="through 1" to_port="averagable 2"/>
    <connect from_op="Apply Model (12)" from_port="labelled data" to_op="Imaging_Performance (9)" to_port="labelled data"/>
    <connect from_op="Imaging_Performance (9)" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="source_through 2" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    <portSpacing port="sink_averagable 3" spacing="0"/>
    <portSpacing port="sink_averagable 4" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="SR_CV" to_port="training"/>
    <connect from_op="SR_CV" from_port="model" to_port="result 1"/>
    <connect from_op="SR_CV" from_port="averagable 1" to_port="performance"/>
    <connect from_op="SR_CV" from_port="averagable 2" to_port="result 2"/>
    <connect from_op="SR_CV" from_port="averagable 3" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    <connect from_port="training" to_op="Normalize (5)" to_port="example set input"/>
    <connect from_op="Normalize (5)" from_port="example set output" to_op="Optimize Parameters (5)" to_port="input 1"/>
    <connect from_op="Optimize Parameters (5)" from_port="result 1" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (15)" width="90" x="45" y="30">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.6.001" expanded="true" height="82" name="Imaging_Performance (10)" width="90" x="179" y="30"/>
    <connect from_port="model" to_op="Apply Model (15)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (15)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (15)" from_port="labelled data" to_op="Imaging_Performance (10)" to_port="labelled data"/>
    <connect from_op="Imaging_Performance (10)" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve" from_port="output" to_op="SR_Xval" to_port="training"/>
    <connect from_op="SR_Xval" from_port="averagable 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Hi Sebastian,

     

    Thank you for this reply. I've pasted code based on the directions you gave (Option 1). Could you confirm that this is the correct organization?

     

    Additionally, regarding the unbiased optimization estimate, I am wondering about the optimization of parameters for each training set. If there is an inner cross-validation within the optimiztion parameters, wouldn't the selected parameters be based on a subset of the training data? As a result, the optimized parameters would be selected on a subset of the training data and not optimized for the whole training set within each fold?

     

    Thank you

  • lsevellsevel Member Posts: 18 Contributor II

    (Small bump)

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    it's probably better if you tag @SGolbert than bumping. :)

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi lsevel,

     

    The process seems correct. About the second part, the cross validation operator returns a model trained with all the data (in that case one of the training folds). The optimized parameters are selected using the whole training fold, more precisely: averaging the performance of different subsets of said set (i.e. in the inner CV).

     

     

    Best,

    Sebastian

  • darkphoenix_isadarkphoenix_isa Member Posts: 4 Contributor I
    Hi there, recently i use optimize parameter operator. And i'm using CV inside optimize parameter. Just to clarify, so basicaly we have to use optimized parameter inside a CV and wrap it both with outer CV. Is it why i get different result when I applied parameter result from the first approach to the stand alone CV?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yes, that would be considered best practice.  You can also check out this white paper on correct validation here:
    This could be one reason, but another reason could also simply be different splits by the cross validations (depending on the random seed, in order to avoid this, you can give both the same local random seed).  There are potentially more reasons depending on how exactly you set up the process...
    Hope this helps,
    Ingo
Sign In or Register to comment.