"Training classifier (SVM or Logit) with unbalanced training data"

noah977 · January 2009

Post removed.

This was answered previously, but I missed the answer.

The EquallabelWeighting operator solves the problem

steffen · January 2009

Hello Noah

noah977 wrote:

I've read that this makes sense because It is less error (safer) for the classifier to simply mark everything negative with such unbalanced data.

The problem here is that you are not interested in the prediction, but in the confidence. I guess you already know that the prediction is calculated by applying a threshold on the confidence (per default 0.5, but adjustable in RapidMiner (search for "threshold")).
The confidence is representing a scoring or ranking of your items. To estimate the quality of your classification you can look at the ROC-curce or AUC values. The AUC values for instance cannot be biased by class skew, meanwhile the Accuracy can easily be tricked this way (as you mentioned above).

To gain the mentioned probabilities: The calculated confidences are approximations for the probabilities. To better this approximations you use calibration methods. As far as I know the only method implemented is platt scaling, which is the best calibration method for the output of SVM-Classificators and a moderate good method for the output of other classificators.

Feel free to ask if something I explained is not clear

hope this was helpful

kind regards,

Steffen

PS: Indeed, changing the true class distribution can hurt the performance

noah977 · January 2009

Steffen,

Strangely, even with the Platt scaling my results look incorrect. The system predicts a 62% confidence EXACTLY THE SAME for every case to fail. That' indicates that it didn't learn the model well.

steffen · January 2009

Hello Noah

As I said in a PM, Platt Scaling can help. Maybe the used Classification algorithmn is simply not capable of learning the concept, maybe the set is two small so that PS overadjusts the confidences. Regarding my current amount of information of your situation, I cannot tell you more.

regards,

Steffen

PS: Quote describing the situation of data miners perfectly

:

A person who really understands data and analysis will understand all the pitfalls and limitations, and hence be constantly caveating what they say. Somebody who is simple, straightforward, and 100% certain usually has no idea what they are talking about.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Training classifier (SVM or Logit) with unbalanced training data"

Answers