RapidMiner

RapidMiner

Balanced sampling for network training?

Regular Contributor

Balanced sampling for network training?

Hi,
i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?

Thank you very much,
chaosbringer
6 REPLIES
Regular Contributor

Re: Balanced sampling for network training?

Hi chaosbringer

I recommend to ask the question on http://stats.stackexchange.com/

Although we try to ask general questions about data mining here, the amount of experts with time is quite low. Nevertheless it would be great if you post the link to the question here (if you going to ask there).

greetings,

steffen

Regular Contributor

Re: Balanced sampling for network training?

I just stumbled over the answer to his question at said site:

http://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training

Dikran Marsupial:
Yes, it is reasonable to select a balanced dataset, however if you do your model will probably over-predict the minority class in operation (or on the test set). This is easily overcome by using a threshold probability that is not 0.5. The best way to choose the new threshold is to optimise on a validation sample that has the same class frequencies as encountered in operation (or in the test set).

Rather than re-sample the data, a better thing to do would be to
give different weights to the positive and negative examples in the training criterion. This has the advantage that you use all of the available training data. The reason that a class imbalance leads to difficulties is not the imbalance per se. It is more that you just don't have enough examples from the minority class to adequately represent its underlying distribution. Therefore if you resample rather than re-weight, you are solving the problem by making the distribution of the majority class badly represented as well.

Some may advise simply using a different threshold rather than reweighting or resampling. The problem with that approach is that with ANN the hidden layer units are optimised to minimise the training criterion, but the training criterion (e.g. sum-of-squares or cross-entropy) depends on how the behaviour of the model away from the decision boundary rather than only near the decision boundary. As as result hidden layer units may be assigned to tasks that reduce the value of the training criterion, but do not help in accurate classification. Using re-weighted training patterns helps here as it tends to focus attention more on the decision boundary, and so the allocation of hidden layer resources may be better.

For references, a google scholar search for "Nitesh Chawla" would be a good start, he has done a fair amount of very solid work on this.


Regular Contributor

Re: Balanced sampling for network training?

There may be other answers as well..

http://www.google.fr/search?q=imbalanced+neural+network

Now all we have to do is work out the right one, or whether there can be a right one  ;D
Regular Contributor

Re: Balanced sampling for network training?

Hi,
if i understand the post an stackexchange.com correct, it is suggested to weight the samples. I think the operator for this task is "Generate Weights (Straified)" in rapidminer.
However, is there a way in weighting, if the label is numeric? Is this the purpose of the Operator "Generate Weight (LPR)"? I don't really understand the use of the operator from its description.

Thank you very much.
Elite

Re: Balanced sampling for network training?

Hi,
if your label is numeric, you don't have a classification and hence no classes and hence no class imbalance.

If you have true and false, you have no numbers. If true and false are encoded by numebrs, you will need to turn the attributes to nominal ones by applying Numerical to Binominal.

Greetings,
  Sebastian
Old World Computing - Establishing the Future

Check out the Jackhammer Extension for RapidMiner! Crunch more data easier and with up to 700% speed up! Available only here

Regular Contributor

Re: Balanced sampling for network training?

Weighing could help decrease the error rates. I'd be curious to see what you found. I have typically used ~2/3 training/total.

A difference matrix may also be useful to preprocess the data. Dimensional reduction may increase the ability to discriminate between true/false.

A couple ideas- hopefully you find something that works.