i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?
Although we try to ask general questions about data mining here, the amount of experts with time is quite low. Nevertheless it would be great if you post the link to the question here (if you going to ask there).
Yes, it is reasonable to select a balanced dataset, however if you do your model will probably over-predict the minority class in operation (or on the test set). This is easily overcome by using a threshold probability that is not 0.5. The best way to choose the new threshold is to optimise on a validation sample that has the same class frequencies as encountered in operation (or in the test set).
Rather than re-sample the data, a better thing to do would be to give different weights to the positive and negative examples in the training criterion. This has the advantage that you use all of the available training data. The reason that a class imbalance leads to difficulties is not the imbalance per se. It is more that you just don't have enough examples from the minority class to adequately represent its underlying distribution. Therefore if you resample rather than re-weight, you are solving the problem by making the distribution of the majority class badly represented as well.
Some may advise simply using a different threshold rather than reweighting or resampling. The problem with that approach is that with ANN the hidden layer units are optimised to minimise the training criterion, but the training criterion (e.g. sum-of-squares or cross-entropy) depends on how the behaviour of the model away from the decision boundary rather than only near the decision boundary. As as result hidden layer units may be assigned to tasks that reduce the value of the training criterion, but do not help in accurate classification. Using re-weighted training patterns helps here as it tends to focus attention more on the decision boundary, and so the allocation of hidden layer resources may be better.
For references, a google scholar search for "Nitesh Chawla" would be a good start, he has done a fair amount of very solid work on this.
if i understand the post an stackexchange.com correct, it is suggested to weight the samples. I think the operator for this task is "Generate Weights (Straified)" in rapidminer.
However, is there a way in weighting, if the label is numeric? Is this the purpose of the Operator "Generate Weight (LPR)"? I don't really understand the use of the operator from its description.