T-Test for performance comparison
I have a question about standard T-TEST operator (not the one from Statistics extension).
How exactly does it compare performance vectors and does the result depend on a certain main performance criterion, or does it compare all the available performance metrics at once (so, actually comparing vectors and not a single value)?
I am asking because I am not really able to get anything but 1.000 in a significance matrix for different algorithms and settings, given that they are evaluated on a same dataset. I've been trying different models like GLM, tree models, deep learning etc, and the result is always the same. Does that mean that there's actually no statistically significant difference out there?
Another concern, can I use T-test for comparing performance from different folds in cross-validation, or it doesn't make sense at all? I am doing 10-fold validation and store the performance of each fold for later analysis and comparison. And here's what I get:
Significance matrix (shows accuracy values by default)
Performance metrics for each fold performance
Same metrics on graph
I can change settings of a learner and get much worse F-score with higher variance, still the significance matrix would be the same for that case also:
It's hard to tell visually whether those are actually close enough of there's some difference (for example, F-score deviates within visible interval). But all '1.000's confuse me a bit... so where the significant difference should actually start from? Or maybe I am doing something fundamentally wrong here?