Options

How to distinguish unbalanced data in RM and what's the solution?

Rena_2013Rena_2013 Member Posts: 7 Contributor I
How to distinguish unbalanced data in RM and what's the solution(e.g.use what operators and how to set parameters)?

Answers

  • Options
    Rena_2013Rena_2013 Member Posts: 7 Contributor I
    Thanks for your reply! That's very helpful since I am a new learner of RM, I download a data set which contains 

    119,390 instance and 32 attributes and I want to use classification for a supervised learning question. after I run the process, someone told me that the data is unbalanced. In this case, should I better use downsampling since the dataset is a little big? But I cannot find SMOTE in RM~



  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello,

    You are using Naive Bayes, so it doesn't support weighting.

    Now, nearly 120000 records with 32 attributes. How many of these attributes are nominals/polynominals? How many classes (the values that the label variable can have, you can check that in the Results panel too) do you have, and how imbalanced is the data?

    I don't know what is the nature of the problem: are there correlated attributes that can be removed? That I would check first. And finally, do you have all your data there?

    What would happen if you just get a Stratified Sample? (using the operator "Sample (Stratified)"). You have enough examples for it to bring you a decent amount of data to train a model. It's similar to downsampling in the sense that it provides less data but more balanced if it's possible (depending on how imbalanced is your data... if you have 997 examples of one class and 3 of another, it will simply not work well, but it just might do fine if you have 600 and 400)

    Finally, I would never downsample unless I know for sure that I have enough repeated attributes, which is something you can see only if you examine the data first. I would go with SMOTE instead, and that can be found in an extension named Operator Toolbox, it doesn't come with RapidMiner directly.

    A lot of this is about understanding the data and what you want to accomplish. Let me know if any of these things is unclear and I'll do my best to explain.

    All the best,

    Rodrigo.


  • Options
    Rena_2013Rena_2013 Member Posts: 7 Contributor I
    Hi, Rodrigo,
    I am not very clear about the meaning of "How many classes do I have", and  I am sorry that I don't how to judge how imbalanced the data is. The attachment is the dataset that I download and trying to deal with. I want to analyze is there any relevant attributes lead to the repeated booking(I set it as a label) and use NB/KNN algorithm for cross-validation. But it seems that I made some technical mistakes(unable to balance the data), I tried weighting but the software reported an error and I do not know how to deal with it~

    Thanks for your kindness~

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited April 2020
    Let's go step by step.

    Your dataset has a lot of columns. What is the column you want to predict? I just assumed you want to predict if certain reservation is canceled, right? So the column you want to predict is named is_canceled.

    In RapidMiner, the column you want to predict is your Label. In other terminologies, the column you want to predict is your Class. But actually both concepts are the same.

    id | type  | wheels | engine
    ----------------------------
     1 | moto  | 2      | yes
     2 | moto  | 2      | yes
     3 | bike  | 2      | no
     4 | car   | 4      | yes
     5 | ????? | 2      | yes

    See the column that has ID=5? I don't know the type of vehicle but I know it has 2 wheels and an engine. That's what I want to predict, so that's my class or my label.

    How many non-null unique values do I have on that column? That's the number of classes you have.
    moto (2)
    bike (1)
    auto (1)
    You don't have to do a lot of convoluted things to find the amount of values in your label  Just open your dataset (by importing it and double clicking) and find this window:



    Scroll to the right, and you will find something that says "Values". Click on "Details" and find this:



    I have two classes. One called "No" and other called "Yes", and the absolute count on each one of these classes is not equal, so... this is highly imbalanced.

    Now that you came all the way here... how many classes do you have and what is the value of each? I know the answer but I want you to do your work. It's the best way to learn. :) Once you get this, we'll make one example of weighting, another of upscaling and another of downscaling.

    All the best,

    Rodrigo.

  • Options
    MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    @Rena_2013 you could see this tutorial on RM academy that explains why and how you can handle imbalanced data
    https://academy.rapidminer.com/learn/course/data-engineering-master/data-cleansing/cleansing
     @rfuentealba is giving you great advises that will help you understand the whole problem and not only focusing on the fact that you have imbalanced data. With time you'll learn that understanding your data, your problem will make a lot of difference on the success of your solution.
  • Options
    Rena_2013Rena_2013 Member Posts: 7 Contributor I
    Hi, Rodrigo,
    Thanks for your detailed explanation of how to judge imbalanced data in RM, I imported the data and eventually found the result:


    It seems that the data for the label  "is repeated " is imbalanced, am I right?

    So, next what suitable measures should I take to balance this data in RM, thanks very much!

    Best,
    Rena
  • Options
    Rena_2013Rena_2013 Member Posts: 7 Contributor I
    Hi,MarcoBarradas 
    I tried the demo measures for my dataset in RM and there was always an error that appeared. I think I must make some mistakes.
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @Rena_2013,

    Could you please share all your processes in order we can reproduce (and understand) what you observe ?

    Regards,

    Lionel
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited April 2020
    Hello @Rena_2013,

    So far you're good to go to the next step. I'll leave it as "pending for now" but:
    • You don't know if a column can uniquely identify a row. (e.g., if you have the ID of the person booking the hotel room).
    • You don't know if a column contains too many null values. (e.g., if the amount of people registering their company name is too low).
    • You don't know if a column is too stable to be a predictor. (e.g. if you have a column that has too many rows with the same value)
    • You don't know if a column can correlate to another column. (e.g. "Saturday" and "Sunday" will correlate to "Weekend")
    • You don't know if a date column can hide a date pattern. (more on this, probably later).
    We have a number of different approaches for sampling, depending on what we want, but none of these help having a "balanced" dataset. We should install two extensions:
    • The Mannheim RapidMiner Toolbox has an operator for Sample (Balanced).
    • The Operator Toolbox has an operator for SMOTE (upsampling)
    What these operators do can be achieved by a combination of other operators, but it's time consuming and why should we do it if there is something that can do it for us? The next steps are going to the Marketplace, find these extensions, install these, restart RapidMiner and put the Sample (Balanced) operator. Do that just before your validation, configure to get a balanced sample of, let's say, 10000 examples. Put a breakpoint and see what happens.

    Now your training class is balanced but keep in mind that it doesn't mean it's using all the data to train your model. Downsampling has that as a drawback: some data can be left behind and you could end up with a model that isn't properly trained. Many people here are against it for these reasons. There is a way to check when is it safe to use downsampling, but this message is a little too large now.

    The process for upsampling is the same.

    Once you come back with your results, I'll explain how to do weighting and what algorithms support it, is it ok?

    All the best,

    Rod. 

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited April 2020
    Ok, I couldn't resist the temptation.

    Let's say you have a trimmed-down version of the Titanic DataSet that holds the following columns.
    Survived, Age, Gender, Passenger Class
    We could say: "Let's split the age in 10 bins". Now we have:
    • 2 values for "Survived", which is our label.
    • 10 values for "Age"
    • 2 values for "Gender"
    • 3 values for "Passenger Class"
    How many combinations can we make to cover all the possibilities for Age, Gender and Passenger Class? We should check this:

    1, first, male
    1, first, female
    1, second, male
    1, second, female
    ...

    Tedious, and it's the same as if we multiply 10 * 2 * 3 = 60 possible combinations for 916 examples. Would it be possible to use downsampling? We cover all the combinations, right? riiight? Nope. We haven't considered that "Survived" has two possible values. So, if we make a stratified sample to 60 combinations it wouldn't catch everything. If we make a stratified sample to 120, we would "probably" catch everything but wouldn't have a good variety of data, and every combination would be probably distributed evenly.

    If we downsample to 240 examples, our chances are a little bit better but still not too much. I would dare to say that with 600 samples (which is 10 samples for each one of the combinations) might give us a prediction that is good enough to train quickly (and downsampling is used for speed, not for accuracy).

    When is such a thing needed? When we have a low amount of possible combinations, a high amount of examples and we want to have an idea of what could be happening, but always keeping in mind that other strategies will provide us with much better decision power.

    I normally use downsampling at the very early stage of any kind of model because it helps me figuring out certain patterns, but of my large collection of models I think just 3 or 4 of the ones in production use downsampling as a technique for training (and I'm known for writing data science models using insane amounts of data).

    Just some complementary information.

    All the best,

    Rod.
  • Options
    Rena_2013Rena_2013 Member Posts: 7 Contributor I
    Hi, Rodrigo,
    Now I understand why downsampling and upsampling are not so appropriate in data preprocessing. and I download the extension in RM and found the Sample (Balanced) operator and tried it in the NB algorithm, the result came out to be a little better than no usage of  Sample (Balanced) operator situation. But I think it is still not balanced enough, is it?



    I also found SMOTE (upsampling) operator, but I am not how to set the paremeters~

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hi @Rena_2013,

    I'll focus on this image first which is one of the most important results you can have in terms of checking if something is going well or not. (there are others such as p-values, f1 score, AUC, ROC, etc... but for a newcomer this one is important).

    This is how you read it:
    • Predicted 0 that are real 0 = 18662.
    • Predicted 1 that are real 1 = 10987.
    • Total predictions that are certain = 29649.
    Compare that number with this one:
    • Predicted 1 that are real 0 = 69.
    • Predicted 0 that are real 1 = 129.
    • Total predictions that are erroneous = 198.
    Your class precision and class recall values are very, very high. But you have 18791 values for false and 11056 values for true, meaning your data is not balanced.

    Check "allow downsampling" and play with the number of examples, from 5000 to 5000, to check how it varies.

    For a starter, I would think of leaving it as is and go back to check the other things I told you before:
    • Unique values.
    • Stable values.
    • Null values.
    You can play with SMOTE and check which one suits you better. I normally leave the values as is.
    • Options
      Rena_2013Rena_2013 Member Posts: 7 Contributor I
      Hi, Rodrigo,
      Thanks very much, now I have got a balanced data, again thanks for all your nice help!

      Best,
      Rena
    • Options
      keb1811keb1811 Member Posts: 11 Contributor I
      Hi @rfuentealba i use the sample operator from the mannheim toolbox also for upsampling a minor class... can you say if its realistic that my weighted mean precision/ recall increase from around 35% up to 70% ? I have a classification with 12 classes of my label attribute.
    • Options
      rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
      Hello @keb1811,

      No, you are looking at two different things.

      Though it's expected (not really, but it's a good sign) that your precision/recall goes up, it is something I should examine to know how my model behaves with the data I gave it, but it doesn't give any clues on how my data behaves.

      Data is balanced when all classes have equal amounts of data, e.g.

      3 apples, 3 oranges, 3 bananas, 3 pineapples.

      If you have:

      3 apples, 3 oranges, 3 bananas, 9 pineapples.

      That data is balanced only if you do first "pineapple" vs "non-pineapple" and then on all non-pineapple you apply a second algorithm to the "non-pineapple" values. But notice that I'm looking at how many examples I have on each class, not on the precision/recall.

      Why? No matter how balanced is your data (taking the same example), if you put a rotten orange (which is green-ish), the algorithm can categorize it as an apple (also green-ish). No matter how balanced is your data, the precision/recall will tell you that of all the predicted values, one was truly an orange and was predicted as an apple.

      A better idea would be to take a look at "predicted vs true". If all the classes have the same quantities (or similar ones), then your data is balanced.

      Can you provide the entire matrix?

      All the best,

      Rod.
    Sign In or Register to comment.