Getting reliable results. Which model to choose?

cliftonarmscliftonarms Member Posts: 32 Contributor II
Advice kindly sought from any "seasoned" data predictors / miners out there.

I have created an experiment within Rapidminer to iterate through different inputs and modelling configurations, attempting to find the "best prediction fit" for my data.

The data : consists of 3100 rows of learning data and 300 rows of unseen testing data.

Each dot on the graph below represents an individual model plotted at its learning performance vs testing performance. ( the scale is not relevant )

image

My question is : which model should I choose to produce the most reliable and robust prediction of new "unseen" data?
  • Choose a model from the ORANGE area where the training performance was very good, but the testing performance was poor.
  • Choose a model from the BLUE area where the training performance was good, but the testing performance was good.
  • Choose a model from the GREEN area where the training performance was poor, but the testing performance was very good.
Ask any questions, and thank you in advance for your help.

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    What do you mean with "learning performance"? The error on the training set? Could you post the process with your experiment?
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    The process is very large and complicated, so it would not be appropriate to post.


    Learning performance is the prediction profit averaged over ALL 3100 rows of training data.
    Testing performance is the prediction profit averaged over ALL 300 rows of testing ( unseen ) data.

    The higher the profit the better the performance of the prediction system.

    My problem is the trained models do not perform with the same prediction rate on the unseen data ( obviously ), so its how to choose the best model to go live with.
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    Evaluating models on the whole training data is never a good idea and leads to overfitting, i.e. models that behave good on the training data can fail on test data completely, thus the performance is not very meaningful. For instance, a kNN classifier will predict the training set with 100% accuracy for the parameter k=1, but will not yield good results for unseen data. Therefore you shouldn't measure the performance on the same data you have used to learn the model, i.e. the learning performance is not useful at all.

    It seems that you have a label for your test set (which you need to measure the performance). My suggestion would be to join the training- and test data and apply a parameter optimization. Within this optimization operator you should use the cross validation (X-Validation) to calculate the performance for a specific parameter set. Basically, a cross validation will partition the input into k subsets. The model will be learned on k-1 subsets and tested on the k-th subset. This will be repeated until every subset was used as test set exactly once. In the end an average performance will be returned. This gives us a more reliable performance measure to select the best parameters for a learner.

    After you have found the best parameter set, use this to learn a model on the whole data set and use this model to predict completely unseen data.
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    Marcin wrote:

    Evaluating models on the whole training data is never a good idea and leads to overfitting, i.e. models that behave good on the training data can fail on test data completely, thus the performance is not very meaningful. For instance, a kNN classifier will predict the training set with 100% accuracy for the parameter k=1, but will not yield good results for unseen data. Therefore you shouldn't measure the performance on the same data you have used to learn the model, i.e. the learning performance is not useful at all.
    I did wonder why some of my models had 100% predictions - I thought it was a bug.
    Marcin wrote:
    It seems that you have a label for your test set (which you need to measure the performance). My suggestion would be to join the training- and test data and apply a parameter optimization. Within this optimization operator you should use the cross validation (X-Validation) to calculate the performance for a specific parameter set. Basically, a cross validation will partition the input into k subsets. The model will be learned on k-1 subsets and tested on the k-th subset. This will be repeated until every subset was used as test set exactly once. In the end an average performance will be returned. This gives us a more reliable performance measure to select the best parameters for a learner.
    This is exactly how I am producing the models shown above. The 3100 rows of data is used in a 10 fold x-validation process, with the input parameters being varied ( by attribute number and selection method ) to produce different optimised models and the Learning Performance. The unseen data is then applied to these different models to give the Testing performance measure. So the graph above is seen vs unseen performance for each model.

    Its how to choose which model to go with - should I just go with best unseen data performance ? i.e the models in the green circle on the graph.
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    cliftonarms wrote:

    This is exactly how I am producing the models shown above. The 3100 rows of data is used in a 10 fold x-validation process, with the input parameters being varied ( by attribute number and selection method ) to produce different optimised models. The unseen data is then applied to these different models to give a performance measure. So the graph above is seen vs unseen performance for each model.
    If I understand you correctly you are doing the parameter optimization manually. I will attach a small process with k-NN which shows how to do it automatically. The output of the "Optimal Learner" is a model which you should use to classify new unseen data. The input is Iris in my case and should be your whole set of labelled data you have in your case. Please note, that I only optimizing the parameter k. This can differ depending on the learner you are using.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
          <operator activated="true" class="optimize_parameters_grid" compatibility="5.3.005" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="313" y="30">
            <list key="parameters">
              <parameter key="Learner.k" value="[2.0;100.0;100;linear]"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="x_validation" compatibility="5.3.005" expanded="true" height="112" name="Validation" width="90" x="112" y="30">
                <process expanded="true">
                  <operator activated="true" class="k_nn" compatibility="5.3.005" expanded="true" height="76" name="Learner" width="90" x="112" y="30">
                    <parameter key="k" value="100"/>
                  </operator>
                  <connect from_port="training" to_op="Learner" to_port="training set"/>
                  <connect from_op="Learner" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="apply_model" compatibility="5.3.005" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="5.3.005" expanded="true" height="76" name="Performance" width="90" x="246" y="30"/>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_parameters" compatibility="5.3.005" expanded="true" height="94" name="Set Parameters" width="90" x="447" y="165">
            <list key="name_map">
              <parameter key="Learner" value="Optimal Learner"/>
            </list>
          </operator>
          <operator activated="true" class="k_nn" compatibility="5.3.005" expanded="true" height="76" name="Optimal Learner" width="90" x="581" y="165">
            <parameter key="k" value="19"/>
          </operator>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Set Parameters" to_port="through 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_op="Set Parameters" to_port="parameter set"/>
          <connect from_op="Set Parameters" from_port="through 1" to_op="Optimal Learner" to_port="training set"/>
          <connect from_op="Optimal Learner" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    Thanks - my method is slightly different and more complex - is my added complexity gaining me anything I wonder ?

    With your example ( and I understand it is just an example ) you have no control of the model generated and its performance on unseen data. Sort of "open loop" optimisation.

    I automatically varying the : attribute selection via weighting ( 17 methods ), number of attributes ( 5-57 ), classification kernal ( SVM ) and model parameter ( c / gamma / nu). Each run through an x-validation on the 3100 Learning rows. So all the legwork is done automatically.

    So I am varying everything to give me a large set of possible models.

    I then apply these models to the unseen data to check the performance of each model.

    I don't think its the the model creation  I have a problem with. Its after the model is validated do I just pick the model that performs best on "unseen data", is it that simple. e.g. The model with a score of 16.5 for Testing performance  ( unseen ) in the green circle on the graph above.
  • wesselwessel Member Posts: 537 Maven
    Please make the same figure, but now replace "training performance" with "cross-validation on training data performance".

    Best regards,

    Wessel
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    The "training performance " is the 3100 rows of training data applied directly to the model generated by the x-validation process.

    The only reason it is not a % figure is the % prediction correctness is not a useful measure of performance with this system. So the performance number is calculated after the data is applied to the model from the prediction results..



  • wesselwessel Member Posts: 537 Maven
    cliftonarms wrote:

    The "training performance " is the 3100 rows of training data applied directly to the model generated by the x-validation process.
    What? This statement is seriously confusing.

    Out of the cross-validation operator comes a model, yes.
    This is the model on entire training data, yes.
    But this is NOT cross-validation performance.

    You should make the same figure where you use "cross-validation performance" on 1 axis.

    Best regards,

    Wessel
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    OK - I missed two words out.

    The "training performance " is the 3100 rows of training data applied directly to the model generated by the result of the x-validation process.

    The actual average x-validation performance is not captured as it does not represent a reliable performance measure in this scenario.

  • wesselwessel Member Posts: 537 Maven
    Why x-validation performance is a bad measure (worse than full training set performance)?
    If this is the case, you might as well have not done any x-validation.

    Just generate the figure?
    Then you have full-training set performance, x-validation performance, and hold out set performance.
    I would like to see the  x-validation performance and hold out set performance figure.
    As far as I'm aware this is 5 minutes work right?
  • cliftonarmscliftonarms Member Posts: 32 Contributor II
    Thank you - you have given me the Eureka moment I needed - if x-validation performance measure is no good then dont use x-validation. I will just validate using my own performance measure.

    The problem I have is even though my problem space is fundamentally a binomial classification task, each individual prediction carries a different cost.

    For example ( although this is not my problem ) :  Blood pressure classification = blood pressure too High or blood pressure too Low.

    However, a miss-classification of someone whos blood pressure is SLIGHTLY too high/Low is far less serious than miss-classifying someone whos blood pressure is VERY high/low. Hence the % validation performance is useless, I add a unique cost ( actual blood pressure variance around normal ) to each classification prediction, and average over all examples predicted to find the total system performance.
  • wesselwessel Member Posts: 537 Maven
    "Hence the % validation performance is useless"
    This statement is untrue.

    First of all, it is written cross-validation or x-validation, not % validation.
    Secondly, x-validation is a sampling process, has nothing to do with classification cost.
    You should simply change your "measure" of performance to reflect this cost.
    A trick that is sometimes used to reflect different costs while using "standard accuracy" as a performance measure, is copying instances with high cost.
    This can enhance your learner to pick up on the correct patterns.

    Also, check out the "Performance (Costs)" operator!

    Best regards,

    Wessel
  • cbarragancbarragan Member Posts: 3 Contributor I
    If this was all the information available, it would seem that some model in the blue region would be the best bet, not because of what can be told of their performance but because the other two areas have bigger issues.

    The models in the orange area are guaranteed to be bad at generalizing and therefore any prediction can be expected to be bad.

    The models in the green area are the ones that are more challenging, I think they are a result of how the experiment is being conducted, having such models with a really poor performance in the "learning" set and very good performance in the "testing" performance seem to be a result of randomly choosing parameters and just a coincidence on being good in that particular testing set, not a result of having a good model. I think that it would be a fair statement to say that these models aren't approximating the overall error surface in a good manner, so even though they are being good in approximating the testing cases, you can't rely on that model.

    Hope this point of view helps.
Sign In or Register to comment.