Confused by the numerical XValidation output

Confused by the numerical XValidation output


Here is my question this time: why the RMS printed by the XValidation decreases with # of validations?

Here is a simple example:

Data set:

0, 0.18224201
1, 2.002307783
2, 4.187028114
49, 98.21944595

(this is simply Y = 2*X + rand() - 0.5)

Standard XVal experiment:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="H:\tmp\lin.aml"/>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="create_complete_model" value="true"/>
        <parameter key="keep_example_set" value="true"/>
        <parameter key="number_of_validations" value="60"/>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <operator name="LinearRegression" class="LinearRegression">
            <parameter key="feature_selection" value="none"/>
            <parameter key="keep_example_set" value="true"/>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
            <operator name="Performance" class="Performance">

When i increase the number_of_validations, here is what happens:

no_of_val    rms_error

10                0.271 +- 0.040
20                0.258 +- 0.087
30                0.248 +- 0.117
40                0.252 +- 0.122
50                0.239 +- 0.140

I would expect, with # of validations, the error should remain about the same (because it's determined by the rand() ) and its uncertainty decrease?

Elite II

Re: Confused by the numerical XValidation output

increasing the number of validations means, that you divide your dataset into more parts. Since only one part is used for testing and the regression is learned on all other parts, this means you increase the total number of training examples. And hence the dependency is linearly constructed the linear regression captures it more accurate and the error is reduced on the test set.
(This depends on the property of the linear regression to converge with an infinite number of examples constructed with linear dependency into this dependency, if only random error is added.)
But since the test sets now decrease in size, there are more sets consisting of examples with very bad examples, causing a high error and more with very easy examples with low error. So the variance of the error increases, thats why the standard deviation of the RMS increases.

Old World Computing - Establishing the Future

Professional consulting for your Data Science problems