RapidMiner

RapidMiner

Help with workaround for Tools.handleAverages

fig
Contributor II

Help with workaround for Tools.handleAverages

Hi,

It seems that IteratingPerformanceAverage does not handle nested averages properly, as demonstrated by the following process (which is a toy example of 2x2 Cross Validation):

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_of_attributes" value="20"/>
        <parameter key="target_function" value="random"/>
    </operator>
    <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
        <parameter key="iterations" value="2"/>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="number_of_validations" value="2"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <operator name="LinearRegression" class="LinearRegression">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                </operator>
                <operator name="RegressionPerformance" class="RegressionPerformance">
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="main_criterion" value="absolute_error"/>
                </operator>
                <operator name="ProcessLog" class="ProcessLog">
                    <list key="log">
                      <parameter key="run" value="operator.XValidation.value.applycount"/>
                      <parameter key="fold" value="operator.XValidation.value.iteration"/>
                      <parameter key="error" value="operator.RegressionPerformance.value.absolute_error"/>
                    </list>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>


After running the experiment the process log shows:
[tt]
run  fold   error
1 0 0.249
1 1 0.278
2 0 0.359
2 1 0.278
[/tt]

The average of the first run (first two folds) is 0.264, of the second run (last two folds) is 0.319, and the overall average is 0.291.  However if you look at the performance vector returned from IteratingPerformanceAverage it shows the value as 0.282.

This is because in Tools.handleAverage (the outer call, from IteratingPerformanceAverage.apply) the first average vector is the average from the first run, with a value of 0.264 and an average count = 2.  However when the second average vector (from the second run, with value 0.319) is folded in, in the call to Averagable.buildAverage, it is treated as having an average count of only 1, whereas it should really have the same weight as the first average vector.  (Thus the weighted average of (2*0.264 + 1*0.319)/3 gives the incorrect reported value of 0.282.)

Can anyone suggest how to work around this?

I am thinking that in Tools.handleAverages when the first average vector is inserted its average count should be set to 1.

Any help will be greatly appreciated.
3 REPLIES
fig
Contributor II

Re: Help with workaround for Tools.handleAverages

Forgot to mention...
This is a follow up to an earlier post: http://rapid-i.com/rapidforum/index.php/topic,554.0.html.

Cheers,
A
Elite

Re: Help with workaround for Tools.handleAverages

Hi,
thanks for your hint. Because of your detailed description I was able to find the bug relativly quick. If you check out a version from cvs it is already fixed.

Greetings,
  Sebastian
Old World Computing - Establishing the Future

Check out the Jackhammer Extension for RapidMiner! Crunch more data easier and with up to 700% speed up! Available only here

fig
Contributor II

Re: Help with workaround for Tools.handleAverages

Yes, it works now.  Thank you so much!