Options

# PerformanceVector

Legacy User
Member Posts:

**0**Newbie
Hi,

I've some questions about the values stored in a PerformanceVector

since I couldn't find any info in the documentation.

This one is taken from the sample 11_LeaveOneOut:

the "class recall" and "class prediction" (not shown above, but

in RapidMiner) is obvious. My questions are:

1) What do the +/- values behind accuracy, precision and recall

mean? Like the 14.58% in

accuracy: 77.50% +/- 14.58%

Is this the confidence interval, i.e. in which interval the real value

with respect to the estimated value can lie? If so, what is the

confidence level used here? Does this have something to do

with the average and standard deviation?

2) How do you calculate the total precision (96%) and recall (70%)?

I didn't get it.

3) What is the meaning of the mikro values for the two parameters

from 2) ?

Cheers,

benjamin

I've some questions about the values stored in a PerformanceVector

since I couldn't find any info in the documentation.

This one is taken from the sample 11_LeaveOneOut:

The confusion matrix by itself is clear. Also the computation of

PerformanceVector [

-----accuracy: 77.50% +/- 14.58% (mikro: 77.50%)

ConfusionMatrix:

True: bad good

bad: 13 8

good: 1 18

-----precision: 96.00% +/- 8.00% (mikro: 94.74%) (positive class: good)

ConfusionMatrix:

True: bad good

bad: 13 8

good: 1 18

-----recall: 70.00% +/- 21.91% (mikro: 69.23%) (positive class: good)

ConfusionMatrix:

True: bad good

bad: 13 8

good: 1 18

-----AUC: 0.760 +/- 0.200 (mikro: 0.760) (positive class: good)

]

the "class recall" and "class prediction" (not shown above, but

in RapidMiner) is obvious. My questions are:

1) What do the +/- values behind accuracy, precision and recall

mean? Like the 14.58% in

accuracy: 77.50% +/- 14.58%

Is this the confidence interval, i.e. in which interval the real value

with respect to the estimated value can lie? If so, what is the

confidence level used here? Does this have something to do

with the average and standard deviation?

2) How do you calculate the total precision (96%) and recall (70%)?

I didn't get it.

3) What is the meaning of the mikro values for the two parameters

from 2) ?

Cheers,

benjamin

0

## Answers

1,751RM Founder1.) after the +/- values the standard deviation or the performance values are shown. In LOO-Validation, those values are often higher than for example for a cross validation estimation.

2.) it's not a "total" precision but the precision for the class defined to be positive (stated in parenthesis, here: "good"). The precision for the class good in this example is calculated as "18 / (18 + 1)". The total recall is calculated as "18 / (18 + 8)".

3.) that's a bit harder to explain (but you will find out if you search for micro and macro average at google). Just a short note: the macro average is the average of the k runs of a cross validation, the micro average is the average of all single performance values.

Cheers,

Ingo

9Contributor III'm sorry but I can't understand your 3rd answer. Can you explain a little bit more?

I mean, I have this problem: I tried a FeautureSelection template for a NaiveBayes and other learners. I controlled ClassificationPerformance operator showing the accuracy of the best attribute subset, as I did for ProcessLog plotting it for every generation. So I filtered my original example set with an AttributeWeightSelection and passed to a XValidation with the same options I used in the FeatureSelection template to obtain the model (create_complete_mode checked). But the two accuracies (FeatureSelection template and XValidation alone) are different! How is it possible?

Thank you for your time and help!

Fosgene

2,531Unicornthe XValidation uses random numbers to divide the example sets into k bins. If you don't use the same local random seed, it will result in different sets and thus in different learning results.

This effect is espacially noticeable if you have only small training sets.

Greetings,

Sebastian

9Contributor IIThank you again.

Fosgene

2,531Unicorn-1 enables the using of the global random generator. Every random generator generates a deterministic sequence of random numbers. If both operators use the same generator, they will consume a differente part of this sequence. If each of them uses an own random generator, initialised with the same random seed, then they consume the same part of the same sequence.

Oh dear. This is a difficult problem. Hmm. Let me think about: You could use...5. Or 10. or 1518175. But I suggest 1982, because its my year of birth. To be honest: You could use any number you might imagine, but use the same number twice.

And before it comes to your mind:

It's no good style to optimize your local random seed for better performance

Greetings,

Sebastian

347Maventhank thee for the good laugh,

Steffen

1Contributor IHi all,

I would like to ask you something regarding the first post of benjamin. Concretely, about the 1) question about the values behind +/-. The answer to this question by Ingo is the following: I am really sorry because I don't get how the computing of the standard deviation is done. I have tried by myself but I don't get the same values. Could you please explain it a little bit more in detail how do you compute the standard deviation (the +/- value) for the accuracy?

Thanks a lot in advance,

slv.

2,531Unicornjust compute the standard deviation as usual. Can be found in Wikipedia for example. If I remember correctly we use one degree of freedom for the calculation.

Greetings,

Sebastian