RM 9.1 feedback : Auto-Model /Calculation of the Std-dev of the performance

lionelderkrikor · December 2018

Hi,

There is an inconsistency between the standard deviation of the performance delivered by the Performance Average (Robust) operator :

and the Standard deviation of the logged performance (inside the CV) calculated via the Aggregate operator :

Image: https://us.v-cdn.net/6030995/uploads/editor/lk/enm1ohh0p6wh.png

We can see that the average of the performance is the same in both case.

How explain this difference of results ?

Regards,
Lionel

NB : The process in attached file

IngoRM · December 2018

Hi Lionel,

Sorry, you are right, I was a bit on a wrong track here :-) And thanks for your persistence, since this indeed looks like the std dev calculation in the new average operator is broken. After it all it does not seem too robust then :-) We will have a look into this asap.

Best,

Ingo

IngoRM · December 2018

Hi Lionel,

Ok, we checked a bit deeper here. The difference in the numbers (0.044 vs. 0.05) is a result of the Bessel correction which is performed as part of the std dev calculation in Aggregation but not when the std dev is calculated on the performance vectors. Here is more on the Bessel correction in case you are interested in the details: https://en.wikipedia.org/wiki/Bessel's_correction

If you are familiar with Excel, the two functions for this are "STDEV.P" vs. "STDEV.S". It is important to notice that there is no real "right" or "wrong" here, although I typically would apply the correction (or, in Excel-speak, use the function STDEV.S).

The reason why the Aggregate calculation performs the correction (i.e. using N-1 as the denominator instead of N) is that the Aggregate function is working on a data table which typically a sample of the complete population. Therefore, the correction should be applied following this logic.

You could argue that the same could hold true for the average building of the performance vectors (which I could easily agree with). However, the original implementation assumed that the population values which are averaged are not a sample but the complete known population. Which I can also follow to some degree.

By the way, this whole phenomenon can also be observed if you run a cross validation and compare the std dev there to the one calculated by yourself or by Excel. Depending on which function you use, i.e. if you apply the correction or not, you would either get the result from the cross-validation or from the aggregate operator.

This is a tough one to be honest. I see arguments for both sides and I am somewhat inclined to change the calculation of the cross validation to a version where the Bessel correction is applied. But as I said, I can also see the argument for the other side where it should not.

At the end I would like to add that the cross validation operator (and the other validation loop operators) have been around for about 15 years now and nobody ever wanted us to apply the Bessel correction so far. This could be a pointer that either (a) nobody cared or (b) some people did care but agreed that the correction may not need to be applied here. In any case the differences are typically relatively small anyway.

So here you have the reason but where to go from here? What do you think we should do? And others? I would like to understand your views a bit better before I would push this into the product management process. After all, the validation operators are pretty central and changing their behavior should not be easily done...

Best,
Ingo

IngoRM · December 2018

Hi Lionel,

This is really just the result of the different methods we are using inside and outside of the optimization. Inside, we use a full 3-fold cross validation. Outside we use a multiple hold-out set approach with a robust average calculation. So this would be a bit of an apples to oranges comparison. What is more important is that you can compare the same validation methods across the different model types with each other, e.g. the outer multiple hold-out set validation from model A with model B.

Hope this helps,
Ingo

lionelderkrikor · December 2018

Hi @IngoRM,

Thanks for answer me... but ...
I allow myself to insist, being clearer.
In the process I shared, I compare the same validation method (the multiple hold-out set validation) results via 2 different methods of calculation :
- the first is the result provided by the Performance Average (Robust) operator
- the second is the result of the calculation of the average (after removing the 2 outliers via the Filter Example Range operator), via the operator Aggregate operator of the logged performance(s) inside the Performance for Hold-Out Sets (Loop) operator.
From my point of view, the results of these 2 differents methods of calculation must be strictly equal.(it is the case for the average of the performance but not for its standard-deviation)

I hope sincerely you take some time to take a look at this process and the associated results because I think
there is something weird in the standard deviation of the performance associated to the multiple hold-out set validation.

Regards,

Lionel

NB : I must admit that I did not express myself correctly in my first post. In deed, I assimilated the "multiple hold-out set validation"
to the "cross-validation" : I think that misled you...

lionelderkrikor · December 2018

Hi,

You're welcome, Ingo. Thank you for spending time to take a look at this.

Regards,

Lionel

lionelderkrikor · December 2018

Hi Ingo,

Interesting topic, it reminds me of the statistics courses at my engineering school...

I agree with you, it's a difficult choice : There are relevant arguments on both sides.
Personally, the first argument that came to my mind is that the continuity of the method of calculation of the performance must be ensured.
In deed, for example in order to compare the past performances of a model (created many years ago) to future version(s) of this same model,
the method of calculation of the performance must be the same in order to compare "apples with apples".
Since RapidMiner initially calculates the std-dev of the performance associated with cross-validation without Bessel correction, I would be tempted to keep this method...

I hope I took a little bit of debate,

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

RM 9.1 feedback : Auto-Model /Calculation of the Std-dev of the performance

Best Answers

Answers