[SOLVED]Optimize selection, how to get the resulting best model?

juanchinjuanchin Member Posts: 2 Contributor I
edited November 2018 in Help
Hello,

first of all i would like to congrat you for this magnificent piece of software you have done, It´s really productive and easy to grow all the steps one could figure out.

Now, my question is that I have an Optimeze Selection (evolutionary) operator with a Log to see the population fitness of the evolution process. My problem is that I dont know how to get the best resulting model and the only thing that I can do is to re-train another model with the resulting attribute weigths.

Is this the correct way to do it?.

Again, congratulations to Rapid-I and the developers of this soft.  :o

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    thanks for your kind words! We really appreciate positive comments about RapidMiner. (Of course we also appreciate negative comments but we actually like the positive ones much more  ;D )

    Now, my question is that I have an Optimeze Selection (evolutionary) operator with a Log to see the population fitness of the evolution process. My problem is that I dont know how to get the best resulting model and the only thing that I can do is to re-train another model with the resulting attribute weigths.
    Yes, you have to train the model on the complete data set - but only on those attributes which has been selected - to get the final prediction model. There is actually a good reason for that: there actually is no "best" resulting model: I assume that you have used an inner cross validation, let's say with 10 folds. That means that there are actually 10 different models for each attribute selection. Which one is the best? The one with the best performance on the test set? Well, that would be overfitting to the test set. My answer is: there is no best model coming out from cross validation. Cross validation is for performance estimation only, not for model selection. This has to be done independently in order to not introduce a new form of test-set-overfitting.

    Getting the weights and the data to the outside of the cross validation actually also allows for more nice tricks: you could now train the right model on the complete data, apply the weights to an independent test set which has not been used for the attribute selection, calculate a performance and put all these things in another, outer cross validation. By this you can measure even the overfitting effect of the attribute selection itself (which will be definitely there!).

    Below you can find a process which trains the final model and applies it on another data set without the label (scoring). As you can see, it is important to select the same attributes also on the other data set which can be done with the "Select by Weights" operator. This process, however, does not show an outer cross validation...

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Root">
        <process expanded="true" height="486" width="550">
          <operator activated="true" class="generate_direct_mailing_data" compatibility="5.2.000" expanded="true" height="60" name="Create Training Data" width="90" x="45" y="120">
            <parameter key="number_examples" value="500"/>
          </operator>
          <operator activated="true" class="optimize_selection_evolutionary" compatibility="5.2.000" expanded="true" height="94" name="Optimize Selection (Evolutionary)" width="90" x="179" y="120">
            <process expanded="true" height="504" width="840">
              <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
                <parameter key="use_local_random_seed" value="true"/>
                <process expanded="true" height="504" width="395">
                  <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="45" y="30"/>
                  <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
                  <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="504" width="395">
                  <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="5.2.000" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="Log" width="90" x="179" y="30">
                <list key="log">
                  <parameter key="generation" value="operator.Optimize Selection (Evolutionary).value.generation"/>
                  <parameter key="best performance" value="operator.Optimize Selection (Evolutionary).value.best"/>
                </list>
              </operator>
              <connect from_port="example set" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="313" y="30"/>
          <operator activated="true" class="generate_direct_mailing_data" compatibility="5.2.000" expanded="true" height="60" name="Create Test Data" width="90" x="45" y="255">
            <parameter key="number_examples" value="500"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.2.000" expanded="true" height="76" name="Remove Label" width="90" x="179" y="255">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="label"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="Use Same Attributes for Test Data" width="90" x="313" y="255"/>
          <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="447" y="30">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Create Training Data" from_port="output" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
          <connect from_op="Optimize Selection (Evolutionary)" from_port="example set out" to_op="Naive Bayes (2)" to_port="training set"/>
          <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_op="Use Same Attributes for Test Data" to_port="weights"/>
          <connect from_op="Naive Bayes (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Create Test Data" from_port="output" to_op="Remove Label" to_port="example set input"/>
          <connect from_op="Remove Label" from_port="example set output" to_op="Use Same Attributes for Test Data" to_port="example set input"/>
          <connect from_op="Use Same Attributes for Test Data" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Cheers,
    Ingo
  • juanchinjuanchin Member Posts: 2 Contributor I
    Thank you Ingo for your response,

    I think that I understand what you say, but, in this case, what i was actually doing was splitting the data inside the Optimize Selection and evaluate it over one third of the oiriginal data to get the performance, so the process could run a little faster than a entire X-Validation.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.001" expanded="true" name="Root">
        <process expanded="true" height="486" width="681">
          <operator activated="true" class="generate_direct_mailing_data" compatibility="5.2.001" expanded="true" height="60" name="Create Training Data" width="90" x="45" y="120">
            <parameter key="number_examples" value="500"/>
          </operator>
          <operator activated="true" class="optimize_selection_evolutionary" compatibility="5.2.001" expanded="true" height="94" name="Optimize Selection (Evolutionary)" width="90" x="179" y="120">
            <process expanded="true" height="605" width="794">
              <operator activated="true" class="split_data" compatibility="5.2.001" expanded="true" height="94" name="Split Data" width="90" x="112" y="120">
                <enumeration key="partitions">
                  <parameter key="ratio" value="0.7"/>
                  <parameter key="ratio" value="0.3"/>
                </enumeration>
              </operator>
              <operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="30"/>
              <operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Apply Model (3)" width="90" x="380" y="120">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="5.2.001" expanded="true" height="76" name="Performance (2)" width="90" x="514" y="30">
                <list key="class_weights"/>
              </operator>
              <operator activated="true" class="log" compatibility="5.2.001" expanded="true" height="76" name="Log" width="90" x="648" y="120">
                <list key="log">
                  <parameter key="generation" value="operator.Optimize Selection (Evolutionary).value.generation"/>
                  <parameter key="best performance" value="operator.Optimize Selection (Evolutionary).value.best"/>
                </list>
              </operator>
              <connect from_port="example set" to_op="Split Data" to_port="example set"/>
              <connect from_op="Split Data" from_port="partition 1" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (3)" to_port="unlabelled data"/>
              <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (3)" to_port="model"/>
              <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="447" y="30"/>
          <operator activated="true" class="generate_direct_mailing_data" compatibility="5.2.001" expanded="true" height="60" name="Create Test Data" width="90" x="45" y="255">
            <parameter key="number_examples" value="500"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.2.001" expanded="true" height="76" name="Remove Label" width="90" x="179" y="255">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="label"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="select_by_weights" compatibility="5.2.001" expanded="true" height="94" name="Use Same Attributes for Test Data" width="90" x="380" y="255"/>
          <operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="581" y="75">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Create Training Data" from_port="output" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
          <connect from_op="Optimize Selection (Evolutionary)" from_port="example set out" to_op="Naive Bayes (2)" to_port="training set"/>
          <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_op="Use Same Attributes for Test Data" to_port="weights"/>
          <connect from_op="Naive Bayes (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Create Test Data" from_port="output" to_op="Remove Label" to_port="example set input"/>
          <connect from_op="Remove Label" from_port="example set output" to_op="Use Same Attributes for Test Data" to_port="example set input"/>
          <connect from_op="Use Same Attributes for Test Data" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Nevertheless, I havent notice that what I was expecting to do was senseless, since, as you say, it will always be better have all the data to train the final model.

    Than you,

    Juan.
  • ffeherffeher Member Posts: 1 Contributor I

    I found this question while trying to learn more about optimizing attribute selection. I loaded the process in and some of the operators (X-validation) looked outdated. With the amazing job of updating that Rapidminer does, I was wondering if there was a more updated version of this process that might perform better? If I was more clever or skilled with Rapidminer, I might be able to answer this myself, but as I am not, I thought maybe some of the awesome people here might be able to help me. Or maybe it operates just fine and requires no changes. 

Sign In or Register to comment.