"Balanced sampling for network training?"

chaosbringerchaosbringer Member Posts: 21 Contributor II
edited May 2019 in Help
i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?

Thank you very much,


  • Options
    steffensteffen Member Posts: 347 Maven
    Hi chaosbringer

    I recommend to ask the question on http://stats.stackexchange.com/

    Although we try to ask general questions about data mining here, the amount of experts with time is quite low. Nevertheless it would be great if you post the link to the question here (if you going to ask there).



  • Options
    spitfire_chspitfire_ch Member Posts: 38 Maven
    I just stumbled over the answer to his question at said site:


    Dikran Marsupial:
    Yes, it is reasonable to select a balanced dataset, however if you do your model will probably over-predict the minority class in operation (or on the test set). This is easily overcome by using a threshold probability that is not 0.5. The best way to choose the new threshold is to optimise on a validation sample that has the same class frequencies as encountered in operation (or in the test set).

    Rather than re-sample the data, a better thing to do would be to
    give different weights to the positive and negative examples in the training criterion. This has the advantage that you use all of the available training data. The reason that a class imbalance leads to difficulties is not the imbalance per se. It is more that you just don't have enough examples from the minority class to adequately represent its underlying distribution. Therefore if you resample rather than re-weight, you are solving the problem by making the distribution of the majority class badly represented as well.

    Some may advise simply using a different threshold rather than reweighting or resampling. The problem with that approach is that with ANN the hidden layer units are optimised to minimise the training criterion, but the training criterion (e.g. sum-of-squares or cross-entropy) depends on how the behaviour of the model away from the decision boundary rather than only near the decision boundary. As as result hidden layer units may be assigned to tasks that reduce the value of the training criterion, but do not help in accurate classification. Using re-weighted training patterns helps here as it tends to focus attention more on the decision boundary, and so the allocation of hidden layer resources may be better.

    For references, a google scholar search for "Nitesh Chawla" would be a good start, he has done a fair amount of very solid work on this.

  • Options
    haddockhaddock Member Posts: 849 Maven
    There may be other answers as well..


    Now all we have to do is work out the right one, or whether there can be a right one  ;D
  • Options
    chaosbringerchaosbringer Member Posts: 21 Contributor II
    if i understand the post an stackexchange.com correct, it is suggested to weight the samples. I think the operator for this task is "Generate Weights (Straified)" in rapidminer.
    However, is there a way in weighting, if the label is numeric? Is this the purpose of the Operator "Generate Weight (LPR)"? I don't really understand the use of the operator from its description.

    Thank you very much.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    if your label is numeric, you don't have a classification and hence no classes and hence no class imbalance.

    If you have true and false, you have no numbers. If true and false are encoded by numebrs, you will need to turn the attributes to nominal ones by applying Numerical to Binominal.

  • Options
    rakirkrakirk Member Posts: 29 Contributor II
    Weighing could help decrease the error rates. I'd be curious to see what you found. I have typically used ~2/3 training/total.

    A difference matrix may also be useful to preprocess the data. Dimensional reduction may increase the ability to discriminate between true/false.

    A couple ideas- hopefully you find something that works.
Sign In or Register to comment.