# "Sampling highly unbalanced data"

Hello guys,

I have a highly unbalanced data set which I'd like to use to build a model. However, I have a question regarding the position of the Sample operator to balance the data: should I put it before my X-Validation and use another Apply Model and Performance after the X-Validation to apply the model on the entire data set instead of just the sample (because the performance from the X-Validation is just the sample) OR should I put the Sample inside the training part of the X-Validation?

Thanks in advance!

## Answers

2,326RM Data ScientistHello vdhaxel,

the question is very good and might depend on your actual problem. I usually put it only on the training side of the cross validation, because I want to reduce the bias of the learner but not change the performance measure. Obviously you need to have either a class balance independed performance measure or the balance as in application.

Have you consindered using Weights? Generate Weight (Stratification) is a very handy operator for your problem.

~Martin

Dortmund, Germany

3Contributor IHello Martin,

Thank you very much for your answer. I found out, however, that if I put the Sample operator before my X-Validation and then Apply Model and Performance again AFTER the X-Validation (to apply my model (Logistic Regression) on the entire dataset instead of just the sample), I get the same accuracy, confusion matrix, etc. as when I would just not use X-Validation (e.g. replace X-Validation with Logistic Regression). So I assume this is an incorrect way of doing it?

About the Generate Weights operator: how exactly should I use it? Can I use it on logistic regression? Where do I put this operator? Sorry, I'm relatively new to RapidMiner.

Thanks again!

2,326RM Data ScientistHi,

to your first question: This seems somehow odd. Can you post the XML of an example process so i can have a look here?

On the weights: You can simply put it infront of X-Validation. It creates a new coloum with weights so that the sum of weights are equal. I would recommend to change the sumofweights to a bigger number for numerical stability. Not all learners support weighted examples. You can check if a learner supports weights by clicking on it and pressing F1, then you see the supported types. The standard RM LogReg can not handle weights. The Weka one (W-Logistic) can handle weights. As another idea you can use a linear SVM. SVM and SVM (Linear) can handle Weights and might have better results.

You might check two of my recent blog posts on my personal blog on the topic of weights:

http://data-analytics.ghost.io/rapidminer-quick-tip-generate-weight/

and

http://data-analytics.ghost.io/take-care-of-your-weights/

Best,

Martin

Dortmund, Germany

3Contributor IThank you, I've sent you a private message!