How to distinguish unbalanced data in RM and what's the solution?

Rena_2013 · April 2020

How to distinguish unbalanced data in RM and what's the solution(e.g.use what operators and how to set parameters)?

rfuentealba · April 2020

Hello,

Let's see. I normally do very didactic posts to explain everything. If such a thing is too basic for you, don't get offended.

Let's say you are training an algorithm to recognize emojis and you show it the following ones:

The algorithm will recognize the smiles (4) and the winks (4) but not the blushes (2), because the data is not balanced. So, now you know what to look for.

To recognize it, just open your data in "Results" and check the Statistics for the label. Using the Titanic example, you get that the label has two values, that one is "Yes" with 349 values and the other is "No" with 567 values.

You have three simple strategies that can be used, but be forewarned: each one depends on what you want to do with your data. There are more, but these are variants.

Weighting means that you "consider" each smile once (0.1), each wink once (0.1) and each blush twice (0.2). So, what you consider is actually similar to having 4 smiles, 4 winks and 4 blushes (by considering them twice). This is done by passing the weights to the algorithm BUT... it needs to support weighting, not all algorithms do. Decision trees, for example, do support weighting by using a special operator for that.
Downsampling means that you make the classes equal by diminishing the samples to the required amount. It would be useful to show just one smile, one wink and one blush to the algorithm to understand. A pro is that you require less data to train your algorithms, which is good. A con is that if you have a baby smile, a young girl smile, a young boy smile and a grandmother's smile, you will have to choose which ones are more representative. Which one would you choose? That's difficult to know in advance, and you will end up losing data that would potentially be useful for your detection.
Upsampling means that you duplicate the two winks artificially to make each class equal. The pro is that you don't have to lose data, and that you don't need special support. The cons are that you require more time to train your algorithms (because you'll have 12 faces and not 10) and that you are introducing data that wasn't there in the first place, so you need to be careful.

Weighting can be done by a number of methods (by rule, by information gain, etc). For sampling, you need the Operator Toolbox and use SMOTE (Again, there are many other methods to do that with RapidMiner). Weighting is the safe bet, if you ask me. But again, this is done case by case.

Hope this helps,

Rod.

Rena_2013 · April 2020

Thanks for your reply! That's very helpful since I am a new learner of RM, I download a data set which contains

119,390 instance and 32 attributes and I want to use classification for a supervised learning question. after I run the process, someone told me that the data is unbalanced. In this case, should I better use downsampling since the dataset is a little big? But I cannot find SMOTE in RM~

rfuentealba · April 2020

Hello,

You are using Naive Bayes, so it doesn't support weighting.

Now, nearly 120000 records with 32 attributes. How many of these attributes are nominals/polynominals? How many classes (the values that the label variable can have, you can check that in the Results panel too) do you have, and how imbalanced is the data?

I don't know what is the nature of the problem: are there correlated attributes that can be removed? That I would check first. And finally, do you have all your data there?

What would happen if you just get a Stratified Sample? (using the operator "Sample (Stratified)"). You have enough examples for it to bring you a decent amount of data to train a model. It's similar to downsampling in the sense that it provides less data but more balanced if it's possible (depending on how imbalanced is your data... if you have 997 examples of one class and 3 of another, it will simply not work well, but it just might do fine if you have 600 and 400)

Finally, I would never downsample unless I know for sure that I have enough repeated attributes, which is something you can see only if you examine the data first. I would go with SMOTE instead, and that can be found in an extension named Operator Toolbox, it doesn't come with RapidMiner directly.

A lot of this is about understanding the data and what you want to accomplish. Let me know if any of these things is unclear and I'll do my best to explain.

All the best,

Rodrigo.

Rena_2013 · April 2020

Hi, Rodrigo,
I am not very clear about the meaning of "How many classes do I have", and I am sorry that I don't how to judge how imbalanced the data is. The attachment is the dataset that I download and trying to deal with. I want to analyze is there any relevant attributes lead to the repeated booking(I set it as a label) and use NB/KNN algorithm for cross-validation. But it seems that I made some technical mistakes(unable to balance the data), I tried weighting but the software reported an error and I do not know how to deal with it~

Thanks for your kindness~

rfuentealba · April 2020

Let's go step by step.

Your dataset has a lot of columns. What is the column you want to predict? I just assumed you want to predict if certain reservation is canceled, right? So the column you want to predict is named is_canceled.

In RapidMiner, the column you want to predict is your Label. In other terminologies, the column you want to predict is your Class. But actually both concepts are the same.

id | type  | wheels | engine
----------------------------
 1 | moto  | 2      | yes
 2 | moto  | 2      | yes
 3 | bike  | 2      | no
 4 | car   | 4      | yes
 5 | ????? | 2      | yes

See the column that has ID=5? I don't know the type of vehicle but I know it has 2 wheels and an engine. That's what I want to predict, so that's my class or my label.

How many non-null unique values do I have on that column? That's the number of classes you have.

moto (2)
bike (1)
auto (1)

You don't have to do a lot of convoluted things to find the amount of values in your label Just open your dataset (by importing it and double clicking) and find this window:

Scroll to the right, and you will find something that says "Values". Click on "Details" and find this:

I have two classes. One called "No" and other called "Yes", and the absolute count on each one of these classes is not equal, so... this is highly imbalanced.

Now that you came all the way here... how many classes do you have and what is the value of each? I know the answer but I want you to do your work. It's the best way to learn.

Once you get this, we'll make one example of weighting, another of upscaling and another of downscaling.

All the best,

Rodrigo.

Marco_Barradas · April 2020

@Rena_2013 you could see this tutorial on RM academy that explains why and how you can handle imbalanced data
https://academy.rapidminer.com/learn/course/data-engineering-master/data-cleansing/cleansing
@rfuentealba is giving you great advises that will help you understand the whole problem and not only focusing on the fact that you have imbalanced data. With time you'll learn that understanding your data, your problem will make a lot of difference on the success of your solution.

Rena_2013 · April 2020

Hi, Rodrigo,
Thanks for your detailed explanation of how to judge imbalanced data in RM, I imported the data and eventually found the result:

Image: https://us.v-cdn.net/6030995/uploads/editor/db/exr9wyul95ux.png

Image: https://us.v-cdn.net/6030995/uploads/editor/7c/nce9eh37m3n7.png

It seems that the data for the label "is repeated " is imbalanced, am I right?

So, next what suitable measures should I take to balance this data in RM, thanks very much!

Best,
Rena

Rena_2013 · April 2020

Hi,MarcoBarradas
I tried the demo measures for my dataset in RM and there was always an error that appeared. I think I must make some mistakes.

lionelderkrikor · April 2020

Hi @Rena_2013,

Could you please share all your processes in order we can reproduce (and understand) what you observe ?

Regards,

Lionel

rfuentealba · April 2020

Hello @Rena_2013,

So far you're good to go to the next step. I'll leave it as "pending for now" but:

You don't know if a column can uniquely identify a row. (e.g., if you have the ID of the person booking the hotel room).
You don't know if a column contains too many null values. (e.g., if the amount of people registering their company name is too low).
You don't know if a column is too stable to be a predictor. (e.g. if you have a column that has too many rows with the same value)
You don't know if a column can correlate to another column. (e.g. "Saturday" and "Sunday" will correlate to "Weekend")
You don't know if a date column can hide a date pattern. (more on this, probably later).

We have a number of different approaches for sampling, depending on what we want, but none of these help having a "balanced" dataset. We should install two extensions:

The Mannheim RapidMiner Toolbox has an operator for Sample (Balanced).
The Operator Toolbox has an operator for SMOTE (upsampling)

What these operators do can be achieved by a combination of other operators, but it's time consuming and why should we do it if there is something that can do it for us? The next steps are going to the Marketplace, find these extensions, install these, restart RapidMiner and put the Sample (Balanced) operator. Do that just before your validation, configure to get a balanced sample of, let's say, 10000 examples. Put a breakpoint and see what happens.

Now your training class is balanced but keep in mind that it doesn't mean it's using all the data to train your model. Downsampling has that as a drawback: some data can be left behind and you could end up with a model that isn't properly trained. Many people here are against it for these reasons. There is a way to check when is it safe to use downsampling, but this message is a little too large now.

The process for upsampling is the same.

Once you come back with your results, I'll explain how to do weighting and what algorithms support it, is it ok?

All the best,

Rod.

rfuentealba · April 2020

Ok, I couldn't resist the temptation.

Let's say you have a trimmed-down version of the Titanic DataSet that holds the following columns.

Survived, Age, Gender, Passenger Class

We could say: "Let's split the age in 10 bins". Now we have:

2 values for "Survived", which is our label.
10 values for "Age"
2 values for "Gender"
3 values for "Passenger Class"

How many combinations can we make to cover all the possibilities for Age, Gender and Passenger Class? We should check this:

1, first, male
1, first, female
1, second, male
1, second, female
...

Tedious, and it's the same as if we multiply 10 * 2 * 3 = 60 possible combinations for 916 examples. Would it be possible to use downsampling? We cover all the combinations, right? riiight? Nope. We haven't considered that "Survived" has two possible values. So, if we make a stratified sample to 60 combinations it wouldn't catch everything. If we make a stratified sample to 120, we would "probably" catch everything but wouldn't have a good variety of data, and every combination would be probably distributed evenly.

If we downsample to 240 examples, our chances are a little bit better but still not too much. I would dare to say that with 600 samples (which is 10 samples for each one of the combinations) might give us a prediction that is good enough to train quickly (and downsampling is used for speed, not for accuracy).

When is such a thing needed? When we have a low amount of possible combinations, a high amount of examples and we want to have an idea of what could be happening, but always keeping in mind that other strategies will provide us with much better decision power.

I normally use downsampling at the very early stage of any kind of model because it helps me figuring out certain patterns, but of my large collection of models I think just 3 or 4 of the ones in production use downsampling as a technique for training (and I'm known for writing data science models using insane amounts of data).

Just some complementary information.

All the best,

Rod.

Rena_2013 · April 2020

Hi, Rodrigo,
Now I understand why downsampling and upsampling are not so appropriate in data preprocessing. and I download the extension in RM and found the Sample (Balanced) operator and tried it in the NB algorithm, the result came out to be a little better than no usage of Sample (Balanced) operator situation. But I think it is still not balanced enough, is it?

Image: https://us.v-cdn.net/6030995/uploads/editor/o2/kf7vegq6tbn5.png

Image: https://us.v-cdn.net/6030995/uploads/editor/yk/bnlqa9zt4vib.png

I also found SMOTE (upsampling) operator, but I am not how to set the paremeters~

Image: https://us.v-cdn.net/6030995/uploads/editor/pu/3uanwgmfmv7m.png

rfuentealba · April 2020

Hi @Rena_2013,

I'll focus on this image first which is one of the most important results you can have in terms of checking if something is going well or not. (there are others such as p-values, f1 score, AUC, ROC, etc... but for a newcomer this one is important).

This is how you read it:

Predicted 0 that are real 0 = 18662.
Predicted 1 that are real 1 = 10987.
Total predictions that are certain = 29649.

Compare that number with this one:

Predicted 1 that are real 0 = 69.
Predicted 0 that are real 1 = 129.
Total predictions that are erroneous = 198.

Your class precision and class recall values are very, very high. But you have 18791 values for false and 11056 values for true, meaning your data is not balanced.

Check "allow downsampling" and play with the number of examples, from 5000 to 5000, to check how it varies.

For a starter, I would think of leaving it as is and go back to check the other things I told you before:

Unique values.
Stable values.
Null values.

You can play with SMOTE and check which one suits you better. I normally leave the values as is.

Rena_2013 · April 2020

Hi, Rodrigo,
Thanks very much, now I have got a balanced data, again thanks for all your nice help!

Best,
Rena

keb1811 · July 2020

Hi @rfuentealba i use the sample operator from the mannheim toolbox also for upsampling a minor class... can you say if its realistic that my weighted mean precision/ recall increase from around 35% up to 70% ? I have a classification with 12 classes of my label attribute.

rfuentealba · July 2020

Hello @keb1811,

No, you are looking at two different things.

Though it's expected (not really, but it's a good sign) that your precision/recall goes up, it is something I should examine to know how my model behaves with the data I gave it, but it doesn't give any clues on how my data behaves.

Data is balanced when all classes have equal amounts of data, e.g.

3 apples, 3 oranges, 3 bananas, 3 pineapples.

If you have:

3 apples, 3 oranges, 3 bananas, 9 pineapples.

That data is balanced only if you do first "pineapple" vs "non-pineapple" and then on all non-pineapple you apply a second algorithm to the "non-pineapple" values. But notice that I'm looking at how many examples I have on each class, not on the precision/recall.

Why? No matter how balanced is your data (taking the same example), if you put a rotten orange (which is green-ish), the algorithm can categorize it as an apple (also green-ish). No matter how balanced is your data, the precision/recall will tell you that of all the predicted values, one was truly an orange and was predicted as an apple.

A better idea would be to take a look at "predicted vs true". If all the classes have the same quantities (or similar ones), then your data is balanced.

Can you provide the entire matrix?

All the best,

Rod.

How to distinguish unbalanced data in RM and what's the solution?

Answers

Categories