Write best model to disk

chris_mlchris_ml Member Posts: 17 Maven
edited November 2018 in Help
Hi,

I would like to compare the performance of two different learners with
T-Test and Anova and write finally the better model to disk in order to use
it later.

This is my model for the performance evaluation:

T-Test and Anova

<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#p#ygt#Many RapidMiner operators can be used to estimate the performance of a learner, a preprocessing step, or a feature space on one or several data sets. The result of these validation operators is a performance vector collecting the values of a set of performance criteria. For each criterion, the mean value and standard deviation are given. #ylt#/p#ygt#  #ylt#p#ygt#The question is how these performance vectors can be compared? Statistical significance tests like ANOVA or pairwise t-tests can be used to calculate the probability that the actual mean values are different. #ylt#/p#ygt# #ylt#p#ygt# We assume that you have achieved several performance vectors and want to compare them. In this experiment we use the same data set for both cross validations (hence the IOMultiplier) and estimate the performance of a linear learning scheme and a RBF based SVM. #ylt#/p#ygt# #ylt#p#ygt# Run the experiment and compare the results: the probabilities for a significant difference are equal since only two performance vectors were created. In this case the SVM is probably better suited for the data set at hand since the actual mean values are probably different.#ylt#/p#ygt##ylt#p#ygt#Please note that performance vectors like all other objects which can be passed between RapidMiner operators can be written into and loaded from a file.#ylt#/p#ygt#"/>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="attributes_lower_bound" value="-40.0"/>
        <parameter key="attributes_upper_bound" value="30.0"/>
        <parameter key="number_examples" value="80"/>
        <parameter key="number_of_attributes" value="1"/>
        <parameter key="target_function" value="one variable non linear"/>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="sampling_type" value="shuffled sampling"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="C" value="10000.0"/>
            <list key="class_weights">
            </list>
            <parameter key="svm_type" value="nu-SVR"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="RegressionPerformance" class="RegressionPerformance">
                <parameter key="absolute_error" value="true"/>
            </operator>
        </operator>
    </operator>
    <operator name="XValidation (2)" class="XValidation" expanded="yes">
        <parameter key="sampling_type" value="shuffled sampling"/>
        <operator name="LinearRegression" class="LinearRegression">
        </operator>
        <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier (2)" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="RegressionPerformance (2)" class="RegressionPerformance">
                <parameter key="absolute_error" value="true"/>
            </operator>
        </operator>
    </operator>
    <operator name="T-Test" class="T-Test">
    </operator>
    <operator name="Anova" class="Anova">
    </operator>
</operator>
I have no idea how to retrieve the better learner after evaluation with T-Test and
Anova. I assume that both models must be temporarily stored before the T-Test
operator and than the better learner is finally stored to disk depending on the
PerformanceVector of Anova, right? But I have no idea how to do that.
And ideas?

Regards,
Chris

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Chris,
    as far as I know ANOVA only calculates the probability that the models are not the same. How do you want to select the better model with that?
    Instead you could use the original performance vectors to choose the better model. If you want to compare different learner, try it with something like that:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="attributes_lower_bound" value="-40.0"/>
            <parameter key="attributes_upper_bound" value="30.0"/>
            <parameter key="number_examples" value="80"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="target_function" value="one variable non linear"/>
        </operator>
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="OperatorSelector_train.select_which" value="[1.0;2.0;10;linear]"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="keep_example_set" value="true"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="OperatorSelector_train" class="OperatorSelector" expanded="yes">
                    <parameter key="select_which" value="2"/>
                    <operator name="LibSVMLearner" class="LibSVMLearner">
                        <parameter key="C" value="10000.0"/>
                        <list key="class_weights">
                        </list>
                        <parameter key="svm_type" value="nu-SVR"/>
                    </operator>
                    <operator name="LinearRegression" class="LinearRegression">
                    </operator>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="RegressionPerformance" class="RegressionPerformance">
                        <parameter key="absolute_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="OperatorSelector_train" value="OperatorSelector_apply"/>
            </list>
        </operator>
        <operator name="OperatorSelector_apply" class="OperatorSelector" expanded="yes">
            <operator name="LibSVMLearner (2)" class="LibSVMLearner">
                <parameter key="C" value="10000.0"/>
                <list key="class_weights">
                </list>
                <parameter key="svm_type" value="nu-SVR"/>
            </operator>
            <operator name="LinearRegression (2)" class="LinearRegression">
            </operator>
        </operator>
    </operator>
    Greetings,
      Sebastian
  • chris_mlchris_ml Member Posts: 17 Maven
    Hey Sebastian,

    the model you proposed is what I was looking for. :-)

    However, I was trying to extend it but could not find a working solution.
    So, what I wanted was to replace the simple learners with their default parameters
    in OperatorSelector_train by a GridParameterOptimization, i.e. for two different
    learners I want to optimize their parameters and finally return the most
    accurate learner that can be later used.

    That first problem was that I was not able to add a GridParameterOptimization
    in your model:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="attributes_lower_bound" value="-40.0"/>
            <parameter key="attributes_upper_bound" value="30.0"/>
            <parameter key="number_examples" value="80"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="target_function" value="one variable non linear"/>
        </operator>
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
            <list key="parameters">
              <parameter key="OperatorSelector_train.select_which" value="[1.0;2.0;10;linear]"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="keep_example_set" value="true"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="OperatorSelector_train" class="OperatorSelector" expanded="yes">
                    <parameter key="select_which" value="2"/>
                    <operator name="GridParameterOptimization (2)" class="GridParameterOptimization" expanded="yes">
                        <list key="parameters">
                          <parameter key="LibSVMLearner.svm_type" value="C-SVC,nu-SVC,one-class,epsilon-SVR,nu-SVR"/>
                          <parameter key="LibSVMLearner.degree" value="[1.0;1000.0;10;linear]"/>
                        </list>
                        <operator name="LibSVMLearner" class="LibSVMLearner">
                            <parameter key="C" value="10000.0"/>
                            <list key="class_weights">
                            </list>
                            <parameter key="svm_type" value="nu-SVR"/>
                        </operator>
                    </operator>
                    <operator name="GridParameterOptimization (3)" class="GridParameterOptimization" expanded="yes">
                        <list key="parameters">
                          <parameter key="LinearRegression.feature_selection" value="none,M5 prime,greedy"/>
                          <parameter key="LinearRegression.ridge" value="[0.0;1000.0;10;linear]"/>
                        </list>
                        <operator name="LinearRegression" class="LinearRegression">
                        </operator>
                    </operator>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="RegressionPerformance" class="RegressionPerformance">
                        <parameter key="absolute_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="OperatorSelector_train" value="OperatorSelector_apply"/>
            </list>
        </operator>
        <operator name="OperatorSelector_apply" class="OperatorSelector" expanded="yes">
            <operator name="LibSVMLearner (2)" class="LibSVMLearner">
                <parameter key="C" value="10000.0"/>
                <list key="class_weights">
                </list>
                <parameter key="svm_type" value="nu-SVR"/>
            </operator>
            <operator name="LinearRegression (2)" class="LinearRegression">
            </operator>
        </operator>
    </operator>
    And the second question that comes into my mind is how to
    propagate the best parameters. In your current model, you just
    pass one of the two models to OperatorSelector_apply. But
    with my parameter optimization, two types of information must
    be passed. 1) The best learner (as currently done) and the corresponding
    parameter set for that learner. However, how can this be achieved
    when to chose among different learners?

    I would appreciate again your help. :-)

    Regards,
    Chris

  • chris_mlchris_ml Member Posts: 17 Maven
    Hi guys,

    I'm still stuck with this problem and can't continue my evaluations.  :-\
    Please help me out.

    Thanks a lot.

    Chris
  • steffensteffen Member Posts: 347 Maven
    Hello Chris

    Your posted setup does not work because the requirements of the corresponding operators are not met. If you select an operator and type "F1" a window will open where you can find required input, delivered output and requirements for inner operators (if the selected operator is some kind of OperatorChain).

    GridParameterOptimization requires, that its child operators deliver a performance vector. To produce a performance operator you must create standard train-model-and-apply-it schemes. Something like this:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="no">
            <list key="parameters">
              <parameter key="OperatorSelector_train.select_which" value="[1.0;2.0;10;linear]"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="no">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="keep_example_set" value="true"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="train" class="OperatorChain" expanded="yes">
                    <operator name="LinearRegression" class="LinearRegression">
                    </operator>
                </operator>
                <operator name="apply" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="RegressionPerformance" class="RegressionPerformance">
                        <parameter key="absolute_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
    Got it ? Fine, now lets deal with your problems in detail:

    I suggest a process like this:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="attributes_lower_bound" value="-40.0"/>
            <parameter key="attributes_upper_bound" value="30.0"/>
            <parameter key="number_examples" value="80"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="target_function" value="one variable non linear"/>
        </operator>
        <operator name="opt_linreg" class="GridParameterOptimization" expanded="no">
            <list key="parameters">
              <parameter key="LinearRegression.feature_selection" value="none,M5 prime,greedy"/>
              <parameter key="LinearRegression.ridge" value="[0.0;1000.0;10;linear]"/>
            </list>
            <operator name="XValidation" class="XValidation" expanded="no">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="keep_example_set" value="true"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="LinearRegression" class="LinearRegression">
                    <parameter key="feature_selection" value="greedy"/>
                    <parameter key="ridge" value="1000.0"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="no">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="RegressionPerformance" class="RegressionPerformance">
                        <parameter key="absolute_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="PerformanceWriter" class="PerformanceWriter">
            <parameter key="performance_file" value="C:\linreg.per"/>
        </operator>
        <operator name="opt_libsvm" class="GridParameterOptimization" expanded="no">
            <list key="parameters">
              <parameter key="LibSVMLearner.svm_type" value="epsilon-SVR,nu-SVR"/>
              <parameter key="LibSVMLearner.degree" value="[1.0;1000.0;10;linear]"/>
            </list>
            <operator name="XValidation (2)" class="XValidation" expanded="yes">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="keep_example_set" value="true"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="LibSVMLearner" class="LibSVMLearner">
                    <parameter key="C" value="10000.0"/>
                    <list key="class_weights">
                    </list>
                    <parameter key="degree" value="301"/>
                    <parameter key="svm_type" value="nu-SVR"/>
                </operator>
                <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier (2)" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="RegressionPerformance (2)" class="RegressionPerformance">
                        <parameter key="absolute_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="PerformanceWriter (2)" class="PerformanceWriter">
            <parameter key="performance_file" value="C:\libsvm.per"/>
        </operator>
        <operator name="kill_all_performance_measures" class="IOConsumer">
            <parameter key="io_object" value="PerformanceVector"/>
        </operator>
        <operator name="select_model" class="GridParameterOptimization" expanded="no">
            <list key="parameters">
              <parameter key="OperatorSelector_train.select_which" value="[1.0;2.0;1;linear]"/>
            </list>
            <operator name="OperatorSelector_train" class="OperatorSelector" expanded="yes">
                <parameter key="select_which" value="2"/>
                <operator name="PerformanceLoader" class="PerformanceLoader">
                    <parameter key="performance_file" value="C:\linreg.per"/>
                </operator>
                <operator name="PerformanceLoader (2)" class="PerformanceLoader">
                    <parameter key="performance_file" value="C:\libsvm.per"/>
                </operator>
            </operator>
        </operator>
        <operator name="ParameterSetter" class="ParameterSetter">
            <list key="name_map">
              <parameter key="OperatorSelector_train" value="OperatorSelector_apply"/>
            </list>
        </operator>
        <operator name="OperatorSelector_apply" class="OperatorSelector" expanded="no">
            <parameter key="select_which" value="2"/>
            <operator name="OperatorChain (3)" class="OperatorChain" expanded="no">
                <operator name="linregset" class="ParameterSetLoader">
                    <parameter key="parameter_file" value="C:\linreg.par"/>
                </operator>
                <operator name="ParameterSetter (2)" class="ParameterSetter">
                    <list key="name_map">
                      <parameter key="opt_linreg" value="apply_linreg"/>
                    </list>
                </operator>
                <operator name="apply_linreg" class="LinearRegression">
                </operator>
            </operator>
            <operator name="OperatorChain (4)" class="OperatorChain" expanded="no">
                <operator name="ParameterSetter (3)" class="ParameterSetter">
                    <list key="name_map">
                      <parameter key="opt_libsvm" value="apply_libsvm"/>
                    </list>
                </operator>
                <operator name="apply_libsvm" class="LibSVMLearner">
                    <parameter key="C" value="10000.0"/>
                    <list key="class_weights">
                    </list>
                    <parameter key="svm_type" value="nu-SVR"/>
                </operator>
            </operator>
        </operator>
        <operator name="IOConsumer" class="IOConsumer">
            <parameter key="io_object" value="ParameterSet"/>
        </operator>
        <operator name="IOConsumer (2)" class="IOConsumer">
            <parameter key="io_object" value="PerformanceVector"/>
        </operator>
    </operator>

    The drawback of this setup is that the performance vectors have to be saved to disc. But I hope this is not too severe. Two reasons:
    • As far as I see, it is currently not possible to move the parameter set  beyond the model-selection-optimization-operator (within the process)
    • More important: You should take a look at the difference in  the performance measure. Why ? Imagine two kind of models, one simple and fast, the other one slow and complicated. Automatic selection determine that the second one is better. But looking at the performance value one recognizes that the second one is 10^-4 bigger than the other one. Does this difference justify a much more complicated model ?
    One may ask: Why not nest the optimization as you have tried in your last post ? The main reason is that you are not able to get a usable parameter set. But:
    If your data set is big enough you can use a nested optimization to show that the grid optimization in combination with the selected learning algorithmn is not going to hurt the generalization power of the model.
    The suggested process above is also able to do this, but (in my opinion) it is less powerful in judging the generalization power. One the other hand, the process above is capable of producing a parameter set for the given dataset. So let's use it.

    greetings

    Steffen

    PS: @RapidMiner Guys: If you remove the last two consumer operators some strange resultobjects pop out of nowhere (or their names are somehow twisted). Setting a breakpoint after OperatorChain (4) shows that everything is allright. Since I am considering OperatorSelection as a simple extension of operatorchain,  I wonder  where the "other" resultobjects are "generated".
Sign In or Register to comment.