Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Nested Crossvalidation - out of sample data/ model problem
Hello,
I have a problem with nested cross validation. I have already read and implemented the code from the already discussed subject in the following link:
http://rapid-i.com/rapidforum/index.php?topic=615.0
But this does not really help me.
I have a binary classification problem with input data of the type "real".
I firstly split the data into training and test set. The testset remains completely untouched till the end. For training I will use a cross validation. In order to optimize the parameters of the SVM (C, Gamma), I the optimizeParameters operator. In order to make the parameter setting available to train the optimized new SVM I use the ParameterSettings operator which should deliver the optimal C and Gamma to the new SVM. This SVM is trained again over the whole training set with the optimal C,Gamma. After that I want performance of the new SVM over the training and the testset so I build one model to test it onto the testset and one for the training set.
The problem hereby is that ParameterSettings operator does not seem to deliver the optimal parameter setting to the new SVM. I can see that because the C and Gamma of the new SVM does not have changed after the process. Another indicator is that the kernel model of the test model varies in terms of the amount of support vectors which are used. Moreover if I put a random name in the name map of the Parameter settings operator for the field of “set operator name” (instead of the SVM_train), There is no error message and the result will be the same.
Can you please help me?! Find attached my XML code.
Many thanks in advance
Daniel
Code:
I have a problem with nested cross validation. I have already read and implemented the code from the already discussed subject in the following link:
http://rapid-i.com/rapidforum/index.php?topic=615.0
But this does not really help me.
I have a binary classification problem with input data of the type "real".
I firstly split the data into training and test set. The testset remains completely untouched till the end. For training I will use a cross validation. In order to optimize the parameters of the SVM (C, Gamma), I the optimizeParameters operator. In order to make the parameter setting available to train the optimized new SVM I use the ParameterSettings operator which should deliver the optimal C and Gamma to the new SVM. This SVM is trained again over the whole training set with the optimal C,Gamma. After that I want performance of the new SVM over the training and the testset so I build one model to test it onto the testset and one for the training set.
The problem hereby is that ParameterSettings operator does not seem to deliver the optimal parameter setting to the new SVM. I can see that because the C and Gamma of the new SVM does not have changed after the process. Another indicator is that the kernel model of the test model varies in terms of the amount of support vectors which are used. Moreover if I put a random name in the name map of the Parameter settings operator for the field of “set operator name” (instead of the SVM_train), There is no error message and the result will be the same.
Can you please help me?! Find attached my XML code.
Many thanks in advance
Daniel
Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Root">
<description><p> Often the different operators have many parameters and it is not clear which parameter values are best for the learning task at hand. The parameter optimization operator helps to find an optimal parameter set for the used operators. </p> <p> The inner crossvalidation estimates the performance for each parameter set. In this process two parameters of the SVM are tuned. The result can be plotted in 3D (using gnuplot) or in color mode. </p> <p> Try the following: <ul> <li>Start the process. The result is the best parameter set and the performance which was achieved with this parameter set.</li> <li>Edit the parameter list of the ParameterOptimization operator to find another parameter set.</li> </ul> </p> </description>
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="449" width="815">
<operator activated="true" class="read_excel" compatibility="5.3.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Feature_Matrix_nonlin_test.xls"/>
<parameter key="imported_cell_range" value="A1:IV262"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="label.true.binominal.label"/>
<parameter key="1" value="a1.true.real.attribute"/>
<parameter key="253" value="a253.true.real.attribute"/>
<parameter key="254" value="a254.true.real.attribute"/>
<parameter key="255" value="a255.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="split_data" compatibility="5.3.000" expanded="true" height="94" name="Split Data" width="90" x="45" y="120">
<enumeration key="partitions">
<parameter key="ratio" value="0.9"/>
<parameter key="ratio" value="0.1"/>
</enumeration>
<parameter key="sampling_type" value="linear sampling"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.000" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="optimize_parameters_grid" compatibility="5.3.000" expanded="true" height="130" name="loopThroughLocalParams" width="90" x="313" y="30">
<list key="parameters">
<parameter key="SVM_train.C" value="[1;100;10;quadratic]"/>
<parameter key="SVM_train.gamma" value="[0.0;100;10;quadratic]"/>
</list>
<parameter key="parallelize_optimization_process" value="true"/>
<process expanded="true" height="316" width="699">
<operator activated="true" class="parallel:x_validation_parallel" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="313" y="30">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true" height="334" width="360">
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.000" expanded="true" height="76" name="SVM_train" width="90" x="112" y="30">
<parameter key="gamma" value="100.0"/>
<parameter key="C" value="100.0"/>
<list key="class_weights"/>
</operator>
<connect from_port="training" to_op="SVM_train" to_port="training set"/>
<connect from_op="SVM_train" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="334" width="360">
<operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="Test_train" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.000" expanded="true" height="76" name="ClassificationPerformance_train_train" width="90" x="179" y="30">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Test_train" to_port="model"/>
<connect from_port="test set" to_op="Test_train" to_port="unlabelled data"/>
<connect from_op="Test_train" from_port="labelled data" to_op="ClassificationPerformance_train_train" to_port="labelled data"/>
<connect from_op="ClassificationPerformance_train_train" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_port="result 2"/>
<connect from_op="Validation" from_port="training" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_parameters" compatibility="5.3.000" expanded="true" height="94" name="ParameterSetter" width="90" x="514" y="30">
<list key="name_map">
<parameter key="SVM_train" value="SVM_test"/>
<parameter key="SVM_train" value="applyModel"/>
<parameter key="SVM_train" value="applyModel (2)"/>
</list>
</operator>
<operator activated="true" class="support_vector_machine_libsvm" compatibility="5.3.000" expanded="true" height="76" name="SVM_test" width="90" x="179" y="165">
<list key="class_weights"/>
<parameter key="calculate_confidences" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.000" expanded="true" height="94" name="Multiply (2)" width="90" x="313" y="165"/>
<operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="applyModel" width="90" x="447" y="165">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.000" expanded="true" height="76" name="Performance_testset" width="90" x="581" y="165">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="applyModel (2)" width="90" x="447" y="255">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="5.3.000" expanded="true" height="76" name="Performance_train_new" width="90" x="581" y="255">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
<connect from_op="Split Data" from_port="partition 2" to_op="applyModel" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 1" to_op="loopThroughLocalParams" to_port="input 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="applyModel (2)" to_port="unlabelled data"/>
<connect from_op="loopThroughLocalParams" from_port="performance" to_op="ParameterSetter" to_port="through 1"/>
<connect from_op="loopThroughLocalParams" from_port="parameter" to_op="ParameterSetter" to_port="parameter set"/>
<connect from_op="loopThroughLocalParams" from_port="result 1" to_op="SVM_test" to_port="training set"/>
<connect from_op="loopThroughLocalParams" from_port="result 2" to_port="result 6"/>
<connect from_op="ParameterSetter" from_port="parameter set" to_port="result 1"/>
<connect from_op="ParameterSetter" from_port="through 1" to_port="result 3"/>
<connect from_op="SVM_test" from_port="model" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="applyModel" to_port="model"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="applyModel (2)" to_port="model"/>
<connect from_op="applyModel" from_port="labelled data" to_op="Performance_testset" to_port="labelled data"/>
<connect from_op="applyModel" from_port="model" to_port="result 5"/>
<connect from_op="Performance_testset" from_port="performance" to_port="result 2"/>
<connect from_op="applyModel (2)" from_port="labelled data" to_op="Performance_train_new" to_port="labelled data"/>
<connect from_op="Performance_train_new" from_port="performance" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>
0
Answers
you must *not* apply the paramters to the Apply Model operator, but only to the SVM_train operator. Specifying Apply Model in Set Parameters results in an error in that operator, and it does not do anything. I admit that it is confusing that it does not print out any warning or error, and I'll create an internal ticket requesting to fix that. However, removing the respective lines and keeping only SVM_train -> SVM_test will get your process running correctly.
Best regards,
Marius
thanks a lot for the reply, yes it was quite confusing. But the problem now is that I have combined the nested cross validation with
a forward selection to get the best features with the best parameter setting. The new SVM does not receive the best parameter setting. To control this I have implemented the log operator after the cross validation operator which shows me the best parameter setting of "SVM_train" and the best features. But the new SVM ("SVM_test") does not receive the best parameter setting from the parametersetter operator. I have approved this by letting me show the result of the parametersetting operator. There optimal setting is that one which also ends up in the new SVM. but the log operator of the cross validation shows that a different parameter set is optimal. This is also the one which remains in the SVM_train after the cross validation has ended.
Here is the code: Many thanks in advance for the help.
Kind regards,
Daniel
Here is what you can do: the Optimize Parameters operator has a results output. That outputs the results generated by the inner process with the best parameter combination. So you have to:
- change the order of Optimize Parameters/Forward Selection such that Forward Selection is inside Optimize Parameters. The result should be untouched by this change.
- connect the weights output of the Forward Selection to the results output.
- Outside of the optimization operators, place the Parameter Setter and the Select by Weights operator.
Best regards,
Marius
thanks a lot. That explains it. Is there any other possibility? Cause I want to proceed a wide grid search for optimizing the parameters of the SVM so there a lots of combinations. With you process, for every combination there would be proceeded a FS which costs a lot of time. For me it would be sufficient if in every generation the optimal feature would be added and the optimal parameter setting chosen. I think this would be less computational expensive right?
there should not be a big difference when it comes to the runtime. If you test R parameter combinations, and for each parameter combination you test S combination of attributes, you end up with R*S executions of the cross validation. If you test S attribute combination, and for each attribute combination you search for the best parameters, you end up with the same number of validations.
Please correct me if I did not catch a part of your descriptions.
To reduce the runtime, I suggest to perform a stepwise parameter search: first, search a wide range with only a few steps in each parameter, and then iteratively narrow the search space by adjusting min and max for each parameter to the most promising range.
Best regards,
Marius
yes you are right in the case that the optimal Feature set is FIX. But apriori you cannot know how many features (iterations) will be added for each parameter combination. And it is not senseful to limit it the beginning. You cannot know if the optimal feature set comprises 5 or 20 features. so the maximal iteration number is unknown. However if you have a grid search within the FS it is a little different cause you don't know the number of optimal features either but you have MAYBE less feature selection process but for sure you are faster cause you do the grid search for each level of FS and the less features you have the faster it is and the grid parameter combination are fix.
But I am also sure that you have less iterations in total since it will converge faster to an optimum....
Do you get my point?
But in any case you are right with the stepwise parameter optimization. This is probably the fastest way to do this but anyway it would be helpful to realize the gridseacrh within the FS...
Best regards,
Marius
Like actually use 1 operation to do both feature selection and parameter optimization at once.
Best regards,
Marius
There are many interesting papers on simultaneous feature selection plus parameter optimization.
Most approaches involve genetic algorithms (see the figure below).
The authors from [1] show that doing doing both at once is intrinsically advantageous compared to an iterating setup.
This is especially so when trying to optimize numerical parameters (e.g. C and Lambda in SVM).
[1]
http://nlg.csie.ntu.edu.tw/~cjwang/publications/A%20GA-based%20feature%20selection%20and%20parameters%20optimizationfor%20support%20vector%20machines.pdf
@ Danyo83
If it is of great importance for you, I might be able to implement a GA based simultaneous approach (as outlined in the figure).
Best regards,
Wessel
that sounds great. It would also be a lot faster than a vast grid search for parameter combining it with GA or another FS approach. Do mean via RM?
Marius, maybe this would also be a good internal project for guys from RM?!
having an operator like that would certainly be a plus for RapidMiner, but probably we won't implement it in the near future because we have some other projects with a higher priority going on. However, if you, Wessel, would implement that operator, you can make it available to the public via our Marketplace. There you can offer it for free, or for a fee, at your choice.
Concerning the GA optimization, you may consider to add a penalty on long runtimes during the evaluation of individuals - otherwise it may happen that your GA generates a parameter combination which results in a very high runtime (e.g. large C for the SVM), and that this indivudual survives for several generations and largely slows down the complete process, even though there are better (in terms of performance) and faster combinations. That's a problem we sometimes run into with our genetic parameter optimization, so if implementing it new from scratch that is an itch that could be removed
Best regards,
Marius