Dealing with imbalanced data

In777In777 Member Posts: 29 Contributor II
edited December 2018 in Help

 I am not sure how to deal with imbalance data in rapidminer. This question was already asked before, but the posts are old. E.g. one suggestion is to use the equalweights operator. Unfortunately, I cannot find it in the current version. The other way is to use oversampling or SMOTE. How could it be done in Rapidminer? How can I do the cross-validation in case of oversampling or SMOTE?

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    To address class imbalance, your two main options are either sampling or weighting.  There are multiple operators for both inside RapidMiner.  Exactly which operator you choose and the parameters associated with it will depend in part on the size of your data, your attributes, the learning algorithm you are trying to use, etc.  

     

    The "Generate Weight  - Stratification" operator will assign weights so both class sums are equal, and that's a good starting point, although you are free to use other weighting operators to assign whatever weights you want.  But not all algorithms accept weights. For instance, Decision Trees can accepted weighted examples but Random Forest cannot.  

     

    The native RM "Sample" operator has a "balance data" option that allows you to specify different sampling rates by class, which will allow you to downsample the majority class.  There is another "Sample - Balance" operator available in the free Mannheim Toolbox extension that also allows upsampling of the minority class.

     

    I hope these are helpful starting points.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • In777In777 Member Posts: 29 Contributor II

    Thank you a lot for the answer, Brain. I manage to do undersampling with "RM Sample Operator" and also oversampling with "RM Sample bootstrapping" (just copied several instances of text). The oversampling is generally better then undersampling, but the cross-validation for oversampling shows that I have an overfitting problem (98% on training set and 55% on test set). I do not know how to calculate the performance in this case? So I decides to try weighting instead.

     

    About weighting - I use linear C-SVM, which does not go with weights. Is there any other operators for weights generation that will go with SVM? I can switch to Bayes, but it works not that good with text.

     

    Some information on the task -  I have binary classification, with 1000 sentences in one category and 25000 sentences as the outside category. With oversampling I got 18000 N-gramms with TFIDF-Weights as a features.

     

     And general question - what is better weigting or resampling? How to choose?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I think the general consensus is that weighting is the best solution since information is neither artificially created nor discarded.  But not everything works with weights.  Lib-SVM doesn't but the regular SVM does, so you may want to try that instead.

    Also, it sounds to me like you have a LOT of tokens based on n-grams.  I would consider pruning those back considerably, or you are going to have specificity problems anyways, where number of attributes >> number of examples.  N-grams are most useful when two words mean something together that they don't apart (like "room" and "service" is not the same thing conceptually as "room service") but not as useful when the words don't really change meaning (like adding "bad service" doesn't contribute much beyond "bad" and "service").   I would also prune generally for low frequency terms, since they typically add very little to any predictive model.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • In777In777 Member Posts: 29 Contributor II

    Thank you for the comments. I think I will still work with the n-grams since they are more informative in my case as simple words. I am doing the multilabel/multioutput suprevised classification. So I divided my text-mining task into several binary problems (manually) and perform One vs. Rest-Classification (for 20 classes). That is why I have imbalanced data. All my classes come from one domain of science and only an the level of n-grams I can put them apart. Besides, in the unclassified data set I have also sentences that do not belong to any class. That's why I use several binary classificatory. I aware if the possible correlations between the classes, but I do not know how solve the problem better. Maybe you have some ideas?

    Besides, I've read that SVM is good at feature redaction and so perfect for text classification. So I hope SVM can work well with n-grams. I also prune 5% of all rare words (+stopwords, stemming etc.)

    I have also a follow up question about weighting and SVM:  You suggest to use another SVM Operator, in example linear SVM. What are the difference between both LibSVM and RM SVM? I Know that the data scientists mostly work with LibSVM. May be it is better to choose LibSVM?  Maybe there is still some way to generate weights and use them with LibSVM? I saw by LibSVM the parameter "class weight", can I use 0,5/0,5  for both classes? Is it the same as "generate weights" RM Operator?

Sign In or Register to comment.