Help with workaround for Tools.handleAverages

figfig Member Posts: 4 Contributor I
edited November 2018 in Help
Hi,

It seems that IteratingPerformanceAverage does not handle nested averages properly, as demonstrated by the following process (which is a toy example of 2x2 Cross Validation):

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_of_attributes" value="20"/>
        <parameter key="target_function" value="random"/>
    </operator>
    <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
        <parameter key="iterations" value="2"/>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="number_of_validations" value="2"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <operator name="LinearRegression" class="LinearRegression">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                </operator>
                <operator name="RegressionPerformance" class="RegressionPerformance">
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="main_criterion" value="absolute_error"/>
                </operator>
                <operator name="ProcessLog" class="ProcessLog">
                    <list key="log">
                      <parameter key="run" value="operator.XValidation.value.applycount"/>
                      <parameter key="fold" value="operator.XValidation.value.iteration"/>
                      <parameter key="error" value="operator.RegressionPerformance.value.absolute_error"/>
                    </list>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>
After running the experiment the process log shows:
[tt]
run  fold   error
1 0 0.249
1 1 0.278
2 0 0.359
2 1 0.278
[/tt]

The average of the first run (first two folds) is 0.264, of the second run (last two folds) is 0.319, and the overall average is 0.291.  However if you look at the performance vector returned from IteratingPerformanceAverage it shows the value as 0.282.

This is because in Tools.handleAverage (the outer call, from IteratingPerformanceAverage.apply) the first average vector is the average from the first run, with a value of 0.264 and an average count = 2.  However when the second average vector (from the second run, with value 0.319) is folded in, in the call to Averagable.buildAverage, it is treated as having an average count of only 1, whereas it should really have the same weight as the first average vector.  (Thus the weighted average of (2*0.264 + 1*0.319)/3 gives the incorrect reported value of 0.282.)

Can anyone suggest how to work around this?

I am thinking that in Tools.handleAverages when the first average vector is inserted its average count should be set to 1.

Any help will be greatly appreciated.

Answers

  • figfig Member Posts: 4 Contributor I
    Forgot to mention...
    This is a follow up to an earlier post: http://rapid-i.com/rapidforum/index.php/topic,554.0.html.

    Cheers,
    A
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thanks for your hint. Because of your detailed description I was able to find the bug relativly quick. If you check out a version from cvs it is already fixed.

    Greetings,
      Sebastian
  • figfig Member Posts: 4 Contributor I
    Yes, it works now.  Thank you so much!
Sign In or Register to comment.