Options

# "balanced v unbalanced accuracy"

i have a balanced dataset (50/50) for classification

say i can achieve 80% accuracy on the classification

the real dataset is unbalanced (10/90). is there a way to determine the accuracy on the unbalanced dataset from the balanced one?

thanks

neil

say i can achieve 80% accuracy on the classification

the real dataset is unbalanced (10/90). is there a way to determine the accuracy on the unbalanced dataset from the balanced one?

thanks

neil

Tagged:

1

## Answers

106MavenAn estimate can be obtained rather than a precise value for the accuracy you are looking for, obviously.

However the accuracy achieved for the balanced test dataset is not enough to estimate the accuracy for your 90/10 unbalanced dataset. You rather need the recalls (sensitivities) for the 2 classes, obtained when testing the model on the balanced dataset (these recalls are displayed together with the confusion matrix and your 0.8 accuracy by RM). Let's say r1=0.85 and r2=0.75 are these 2 recalls corresponding to the classes C1 and C2 respectively.

If the class ratio in the unbalanced dataset is 10/90 then a good estimate of the accuracy for this dataset is r1 * 0.1 + r2 * 0.9. That is, your estimate should be quite close to the recall r2 corresponding to the dominant class. Concretely in this example you get the estimated accuracy: 0.85* 0.1+ 0.75* 0.9= 0.76.

Some minimal, common sense requirements should be met for this estimation to work, namely: the set of examples in class C1 in the training dataset you got the model from, is representative for the set of examples in class C1 from the real dataset; plus the same requirement with respect to class C2; plus the same requirements as above in the case of the balanced test dataset (instead of the training dataset). All these can be achieved through appropriate sampling. And obviously your training and test datasets should be large enough, and ideally disjoint. These minimal requirements are expected anyway when training and testing a model. Note that one didn't simply said the training and test datasets should be representative for the real dataset because you use balanced datasets (that change the class distribution, compared to the class distribution in the real dataset).

The idea behind this estimation is that if you were to compute the 2 recalls x1 and x2 of the classes C1 and C2 for various test datasets using the same model (assuming the requirements above are met and your model is appropriate), then the values of x1 and x2 would vary little from their means (due to their small variances). So you can approximate x1 with r1=0.85 and x2 with r2=0.75 you already know. Then you apply the general formula A= x1*w1 + x2* w2, where A is the model accuracy for the generic test dataset, w1 and w2 are the proportions of the classes C1 and C2 in this generic test dataset, and x1 and x2 are the classes' recalls for these dataset and model. In your case w1 and w2 are known to be 0.1 and 0.9, and x1 and x2 are approximated as above.

Finally note that if you apply the general formula above for your balanced test dataset (that is, using equal proportions for the classes: w1=w2=0.5) you should get the accuracy for your balanced dataset, i.e. 0.8.

Regards,

Dan

63Contributor II106MavenAlso I've remarked that you've taught a course on data mining for sociology, and I'm looking for some

ideas regarding a tailored content for such a course as I've been invited to do something similar here in Britain.

Could you suggest some tailored content & specific applications you found interesting? If that's OK with you, perhaps we can discuss this via email. Thanks.

Dan

63Contributor IIActually my course was on EDA and data mining for Journalism, but might have some similar ideas to Sociology.

Haven't done much on Sociology applications, but I'd be happy to brainstorm with you one weekend. Shoot me a private message, and we can go from there.

Regards,

Neil

106Maven