Options

"Training classifier (SVM or Logit) with unbalanced training data"

noah977noah977 Member Posts: 32 Maven
edited May 2019 in Help
Post removed.

This was answered previously, but I missed the answer.

The EquallabelWeighting operator solves the problem
Tagged:

Answers

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello Noah
    noah977 wrote:

    I've read that this makes sense because It is less error (safer) for the classifier to simply mark everything negative with such unbalanced data.
    The problem here is that you are not interested in the prediction, but in the confidence. I guess you already know that the prediction is calculated by applying a threshold on the confidence (per default 0.5, but adjustable in RapidMiner (search for "threshold")). 
    The confidence is representing a scoring or ranking of your items. To estimate the quality of your classification you can look at the ROC-curce or AUC values. The AUC values for instance cannot be biased by class skew, meanwhile the Accuracy can easily be tricked this way (as you mentioned above).

    To gain the mentioned probabilities: The calculated confidences are approximations for the probabilities. To better this approximations you use calibration methods. As far as I know the only method implemented is platt scaling, which is the best calibration method for the output of SVM-Classificators and a moderate good method for the output of other classificators.

    Feel free to ask if something I explained is not clear

    hope this was helpful

    kind regards,

    Steffen

    PS: Indeed, changing the true class distribution can hurt  the performance
  • Options
    noah977noah977 Member Posts: 32 Maven
    Steffen,

    Strangely, even with the Platt scaling my results look incorrect.  The system predicts a 62% confidence EXACTLY THE SAME for every case to fail.  That' indicates that it didn't learn the model well.
  • Options
    steffensteffen Member Posts: 347 Maven
    Hello Noah

    As I said in a PM, Platt Scaling can help. Maybe the used Classification algorithmn is simply not capable of learning the concept, maybe the set is two small so that PS overadjusts the confidences. Regarding my current amount of information of your situation, I cannot tell you more.

    regards,

    Steffen

    PS: Quote describing the situation of data miners perfectly :) :

    A person who really understands data and analysis will understand all the pitfalls and limitations, and hence be constantly caveating what they say. Somebody who is simple, straightforward, and 100% certain usually has no idea what they are talking about.
Sign In or Register to comment.