RM 9.1 feedback : AutoModel /Calculation of the Stddev of the performance
lionelderkrikor
Moderator, RapidMiner Certified Analyst, Member Posts: 731 Unicorn
Hi,
There is an inconsistency between the standard deviation of the performance delivered by the Performance Average (Robust) operator :
and the Standard deviation of the logged performance (inside the CV) calculated via the Aggregate operator :
We can see that the average of the performance is the same in both case.
How explain this difference of results ?
Regards,
Lionel
NB : The process in attached file
There is an inconsistency between the standard deviation of the performance delivered by the Performance Average (Robust) operator :
and the Standard deviation of the logged performance (inside the CV) calculated via the Aggregate operator :
We can see that the average of the performance is the same in both case.
Regards,
Lionel
NB : The process in attached file
Tagged:
0
Best Answers

IngoRM Posts: 1,619 RM FounderHi Lionel,Sorry, you are right, I was a bit on a wrong track here :) And thanks for your persistence, since this indeed looks like the std dev calculation in the new average operator is broken. After it all it does not seem too robust then :) We will have a look into this asap.Best,Ingo5

IngoRM Posts: 1,619 RM FounderHi Lionel,Ok, we checked a bit deeper here. The difference in the numbers (0.044 vs. 0.05) is a result of the Bessel correction which is performed as part of the std dev calculation in Aggregation but not when the std dev is calculated on the performance vectors. Here is more on the Bessel correction in case you are interested in the details: https://en.wikipedia.org/wiki/Bessel's_correctionIf you are familiar with Excel, the two functions for this are "STDEV.P" vs. "STDEV.S". It is important to notice that there is no real "right" or "wrong" here, although I typically would apply the correction (or, in Excelspeak, use the function STDEV.S).The reason why the Aggregate calculation performs the correction (i.e. using N1 as the denominator instead of N) is that the Aggregate function is working on a data table which typically a sample of the complete population. Therefore, the correction should be applied following this logic.You could argue that the same could hold true for the average building of the performance vectors (which I could easily agree with). However, the original implementation assumed that the population values which are averaged are not a sample but the complete known population. Which I can also follow to some degree.By the way, this whole phenomenon can also be observed if you run a cross validation and compare the std dev there to the one calculated by yourself or by Excel. Depending on which function you use, i.e. if you apply the correction or not, you would either get the result from the crossvalidation or from the aggregate operator.This is a tough one to be honest. I see arguments for both sides and I am somewhat inclined to change the calculation of the cross validation to a version where the Bessel correction is applied. But as I said, I can also see the argument for the other side where it should not.At the end I would like to add that the cross validation operator (and the other validation loop operators) have been around for about 15 years now and nobody ever wanted us to apply the Bessel correction so far. This could be a pointer that either (a) nobody cared or (b) some people did care but agreed that the correction may not need to be applied here. In any case the differences are typically relatively small anyway.So here you have the reason but where to go from here? What do you think we should do? And others? I would like to understand your views a bit better before I would push this into the product management process. After all, the validation operators are pretty central and changing their behavior should not be easily done...Best,
Ingo
1
Answers
Ingo
Thanks for answer me... but ...
I allow myself to insist, being clearer.
In the process I shared, I compare the same validation method (the multiple holdout set validation) results via 2 different methods of calculation :
 the first is the result provided by the Performance Average (Robust) operator
 the second is the result of the calculation of the average (after removing the 2 outliers via the Filter Example Range operator), via the operator Aggregate operator of the logged performance(s) inside the Performance for HoldOut Sets (Loop) operator.
From my point of view, the results of these 2 differents methods of calculation must be strictly equal.(it is the case for the average of the performance but not for its standarddeviation)
I hope sincerely you take some time to take a look at this process and the associated results because I think
there is something weird in the standard deviation of the performance associated to the multiple holdout set validation.
Regards,
Lionel
NB : I must admit that I did not express myself correctly in my first post. In deed, I assimilated the "multiple holdout set validation"
to the "crossvalidation" : I think that misled you...
You're welcome, Ingo. Thank you for spending time to take a look at this.
Regards,
Lionel
Interesting topic, it reminds me of the statistics courses at my engineering school...
I agree with you, it's a difficult choice : There are relevant arguments on both sides.
Personally, the first argument that came to my mind is that the continuity of the method of calculation of the performance must be ensured.
In deed, for example in order to compare the past performances of a model (created many years ago) to future version(s) of this same model,
the method of calculation of the performance must be the same in order to compare "apples with apples".
Since RapidMiner initially calculates the stddev of the performance associated with crossvalidation without Bessel correction, I would be tempted to keep this method...
I hope I took a little bit of debate,
Regards,
Lionel