Confused by the numerical XValidation output

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2019 in Help
Hi,

Here is my question this time: why the RMS printed by the XValidation decreases with # of validations?

Here is a simple example:

Data set:

X,Y
0, 0.18224201
1, 2.002307783
2, 4.187028114
...
49, 98.21944595

(this is simply Y = 2*X + rand() - 0.5)


Standard XVal experiment:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="H:\tmp\lin.aml"/>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="create_complete_model" value="true"/>
        <parameter key="keep_example_set" value="true"/>
        <parameter key="number_of_validations" value="60"/>
        <parameter key="sampling_type" value="shuffled sampling"/>
        <operator name="LinearRegression" class="LinearRegression">
            <parameter key="feature_selection" value="none"/>
            <parameter key="keep_example_set" value="true"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="Performance" class="Performance">
            </operator>
        </operator>
    </operator>
</operator>

When i increase the number_of_validations, here is what happens:

no_of_val    rms_error

10                0.271 +- 0.040
20                0.258 +- 0.087
30                0.248 +- 0.117
40                0.252 +- 0.122
50                0.239 +- 0.140

I would expect, with # of validations, the error should remain about the same (because it's determined by the rand() ) and its uncertainty decrease?

Thanks!
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    increasing the number of validations means, that you divide your dataset into more parts. Since only one part is used for testing and the regression is learned on all other parts, this means you increase the total number of training examples. And hence the dependency is linearly constructed the linear regression captures it more accurate and the error is reduced on the test set.
    (This depends on the property of the linear regression to converge with an infinite number of examples constructed with linear dependency into this dependency, if only random error is added.)
    But since the test sets now decrease in size, there are more sets consisting of examples with very bad examples, causing a high error and more with very easy examples with low error. So the variance of the error increases, thats why the standard deviation of the RMS increases.


    Greetings,
      Sebastian
Sign In or Register to comment.