# Relative Overfitting Rate

Member Posts: 19 Maven
Hi there,

I have a question regarding the calculation of the "relative overfitting rate". Background is the comparison of different parameter settings and their overfitting behavior respectively.

The relative overfitting rate was proposed in:
Efron, B.; Tibshirani, R.: Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association. (1997), Nr. 92, S. 548–560.

In this paper the .632 Bootstrap Method is enhaced by some sort of weighting mechanism, which is irrelevant for this post. Anyway, the relevant question regards the formula for the relative overfitting rate which is defined in formula 28 (see below). R being the relative overfitting rate, Êrr1 being the Bootstrap-Leave-one-out Error and err being the "emprical error" (Formula 7). Formula 27 shows the calculation of gamma for a binary classificator. Now here is the question:

Can anyone please explain the me how I can adapt this concept for a regression problem? I have a dataset of 30 Attributes and about 300 examples for which I create a prediction for a label (range 0,01 to 0,1). I have trouble understanding the mathematics behind it.. and the writing. I can retrieve the Êrr1 from the bootstrap operator of RM, but how do I calculate the rest?

Any help greatly appreciated..
Best regards

• Member Posts: 537 Guru
Dear Tek,

First you should measure the variability in your cross validation estimates.

Best regards,

Wessel
• Member Posts: 19 Maven
Hi there,

thanks for the reply. How do I measure the variability?

Besides, I found some new information. I now understand that err, is the generalization error (Test and Training set are equal). And, that for an information about the overfitting I need some sort of "number" to describe the maximal overfitting (here it is gamma, or the "no information error"). Gamma is described as "the averaged permutation of all possible labels with all possible predictors".

But still, two questions remain:

1. How the hell do I actually calculate gamma?
2. For regression, do I have to replace the error function with RMS? (which actually would make sense in a certain way)

Best regards
• Member Posts: 537 Guru
Hey,

This is part of my own unpublished research, so I don't want to give every detail.

One way to calculate the variance in the error of estimate is as follows:
Split the data in two parts.
Run cross validation on the first part to obtain a performance estimate.
Apply the mode (trained on the first part of the data) to the second part of the data and obtain the real performance.
Calculate the difference (error) between real performance and the performance estimate.
Repeat this procedure for many different data splits.

So basically this is a way to validate the validation procedure.
Theory states that cross validation is an unbiased estimate.
So you expect the attribute "esti-real" to have a mean close to zero.
For the synthetic data set I used it was 4.0, so rather close to zero.
After you compute the variance: "(esti-real)^2": 1080.5, you realize the variance is rather high.
So theory states there is room for improvement.

One way to reduce the variance is to use a different value of k, in k-fold cross validation.
But recent developments suggest that its better to use multiple values of k, or combine cross validation with bootstrapping validation.

real             avg = 79.3372 +/- 19.6535 [33.9304 ; 129.2291]
prediction(real) avg = 79.3372 +/- 5.4302 [66.9098 ; 88.8779]
esti             avg = 83.3311 +/- 21.3748 [45.7766 ; 132.2486]
esti-real       avg = 3.9939 +/- 32.7915 [-66.4296 ; 93.7430]
(esti-real)^2     avg = 1080.4833 +/- 1364.4548 [0.0276 ; 8787.7420]
(see figure below) • Member Posts: 537 Guru
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
<process expanded="true" height="391" width="701">
<operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Polynomial"/>
</operator>
<operator activated="true" class="loop" compatibility="5.1.008" expanded="true" height="94" name="Loop Procedure" width="90" x="179" y="30">
<parameter key="iterations" value="100"/>
<process expanded="true" height="409" width="705">
<operator activated="true" class="split_data" compatibility="5.1.008" expanded="true" height="94" name="Split Outer" width="90" x="45" y="165">
<enumeration key="partitions">
<parameter key="ratio" value="0.9"/>
<parameter key="ratio" value="0.1"/>
</enumeration>
<parameter key="sampling_type" value="stratified sampling"/>
</operator>
<operator activated="true" class="x_validation" compatibility="5.1.008" expanded="true" height="112" name="Validation" width="90" x="180" y="30">
<parameter key="sampling_type" value="shuffled sampling"/>
<process expanded="true" height="409" width="299">
<operator activated="true" class="linear_regression" compatibility="5.1.008" expanded="true" height="94" name="Linear Regression (2)" width="90" x="112" y="30"/>
<connect from_port="training" to_op="Linear Regression (2)" to_port="training set"/>
<connect from_op="Linear Regression (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="409" width="346">
<operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Inner" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.1.008" expanded="true" height="76" name="Perf Inner" width="90" x="246" y="30"/>
<connect from_port="model" to_op="Apply Inner" to_port="model"/>
<connect from_port="test set" to_op="Apply Inner" to_port="unlabelled data"/>
<connect from_op="Apply Inner" from_port="labelled data" to_op="Perf Inner" to_port="labelled data"/>
<connect from_op="Perf Inner" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Outer" width="90" x="313" y="165">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.1.008" expanded="true" height="76" name="Perf Outer" width="90" x="447" y="120"/>
<operator activated="true" class="log" compatibility="5.1.008" expanded="true" height="94" name="Log" width="90" x="585" y="30">
<list key="log">
<parameter key="estimate" value="operator.Perf Inner.value.performance"/>
<parameter key="real" value="operator.Perf Outer.value.performance"/>
<parameter key="iteration" value="operator.Loop Procedure.value.iteration"/>
</list>
</operator>
<connect from_port="input 1" to_op="Split Outer" to_port="example set"/>
<connect from_op="Split Outer" from_port="partition 1" to_op="Validation" to_port="training"/>
<connect from_op="Split Outer" from_port="partition 2" to_op="Apply Outer" to_port="unlabelled data"/>
<connect from_op="Validation" from_port="model" to_op="Apply Outer" to_port="model"/>
<connect from_op="Apply Outer" from_port="labelled data" to_op="Perf Outer" to_port="labelled data"/>
<connect from_op="Perf Outer" from_port="performance" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 2" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="90"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log_to_data" compatibility="5.1.008" expanded="true" height="94" name="Log to Data" width="90" x="313" y="30">
<parameter key="log_name" value="Log"/>
</operator>
<operator activated="true" class="store" compatibility="5.1.008" expanded="true" height="60" name="Store" width="90" x="447" y="30">
<parameter key="repository_entry" value="X"/>
</operator>
<operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Result" width="90" x="45" y="165">
<parameter key="repository_entry" value="X"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.1.008" expanded="true" height="76" name="label: real" width="90" x="179" y="165">
<parameter key="name" value="real"/>
<parameter key="target_role" value="label"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="input: estimate" width="90" x="313" y="165">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="estimate"/>
</operator>
<operator activated="true" class="linear_regression" compatibility="5.1.008" expanded="true" height="94" name="Linear Regression" width="90" x="447" y="165"/>
<operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Model" width="90" x="581" y="165">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.1.008" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="255">
<list key="function_descriptions">
<parameter key="estimate-real" value="estimate-real"/>
<parameter key="abs(estimate-real)" value="abs(estimate-real)"/>
</list>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Loop Procedure" to_port="input 1"/>
<connect from_op="Loop Procedure" from_port="output 1" to_op="Log to Data" to_port="through 1"/>
<connect from_op="Log to Data" from_port="exampleSet" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="result 1"/>
<connect from_op="Result" from_port="output" to_op="label: real" to_port="example set input"/>
<connect from_op="label: real" from_port="example set output" to_op="input: estimate" to_port="example set input"/>
<connect from_op="input: estimate" from_port="example set output" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Linear Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="126"/>
<portSpacing port="sink_result 3" spacing="54"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

• Member Posts: 19 Maven
Hi,

thats an interesting process. I did some further research on the overfitting subject, though. In regard to your process, is it possible to feed the ANOVA (performance) operator the real true values of the dataset? So i.e.:

Input 1 is the performance vector (or rather the predictions) of the trained model
Input 2 are the real (true) label values of the original data set

If thats not possible, I would have to find a way to let me output the predictions and calculate the ANOVA (R² respectively) in excel.

Thanks again!
• Member Posts: 537 Guru
Hey,

I'm not sure I understand the question.
The R squared is a measure used when training and testing on the same data.
There is also corrected R squared, which corrects for the number of parameters.

You can calculate the R squared using the Linear Regression operator (I think).
Also using the T-Test operator or ANOVA operator (I think).
But I never used this, because this measure is ill suited because for a lot of learners the number of parameters is an ill defined concept.

Best regards,

Wessel
• Member Posts: 19 Maven
Hey,

let try to me clarify that:

R² is defined as the fraction of "variance explained through regression" / "total variance", or alternatively: 1 - "variance not explained by regression" / "total variance". In the ANOVA chart this would be equivalent to "in between" / "total", or 1 - "residual" / "total" respectively.

Furthermore, in regression, R² is an indicator for "how well a function fits its underlying data". Thus, a R² close to 1 CAN already be a first indicator for overfitting, because all of the variance is explained through the regression model (but certainly, this is not the holy grail, because it might very well be, that the trainined model just fits the data well). In the next step, one can compare the change of R² between training phase and test phase (here I mean by test phase  using the holdout method, using new unseen data) and the change of error, too.

Now, my idea is that: if R² shrinks from training to testing phase (again, by "testing phase" I dont mean the X-Validation testing phase, but rather the actual test on unseen data), one can assume overfitting for the training data. A second indicator would be that error increases from training to testing.

This is due to the definition of R²: if overfitting occured, then unseen data is predicted uncorrectly, then the "variance explained through regression" will shrink, while the "total variance" might not change at all, thus resulting in R² to shrink. On the other hand, the error on unseen data should increase in regard to the error of trained data (which is overfitted, and thus very small).

Combining these two, maybe one can make assumptions about overfitting?

Thanks for further help! Maybe I am completly wrong with my assumptions here. ; )

PS: You mentioned R² is only used if training and test data are the same. Wouldnt it be more correct that R² can only be used if the mean of training and testing data are the same?

PSPS: Another idea, if you compare the residuals (lets say in a histogram), from training to testing phase, the histogram should change its shape from a pyramid shaped form to a more U-shaped form? (rephrased: the overfitted model cannot predict the unseen data correctly, thus the amount of bigger residuals will increase)