Combined SMOTE Operator

darkphoenix_isadarkphoenix_isa Member Posts: 4 Contributor I
edited June 2019 in Help
Hi there, i'm still new and exploring with Rapidminer. Currently i'm working on a project that consist of imbalance dataset. From some research paper, using combination of SMOTE with different selection algorithm might work well for imbalanced problem. I already found SMOTE operator in Rapidminer, but other selection algorithm like Tomek Link or ENN i still couldn't found it.
Is there RM operator for those?

Best Answer

Answers

  • darkphoenix_isadarkphoenix_isa Member Posts: 4 Contributor I
    Thank you very much for your response. I'll explore this solution.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    i am the author of the operator. Can you maybe point me to some references showing the advantages? Maybe we can add it to the operator.

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • darkphoenix_isadarkphoenix_isa Member Posts: 4 Contributor I
    Dear Mr. Martin,

    Thank you for your attention. I get reference for my problem based on this paper :

    https://www.sciencedirect.com/science/article/pii/S0925231215015908

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    The main difference is that SMOTE aims at oversampling the minority class and Tomek-links aims at undersampling the majority class. It would be great to have both. 
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    do you maybe know how this relates to Kennard-Stone Sampling?
    BR,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited July 2019
    @mschmitz I am not an expert on this, however my understanding is that KS algorithm aims to find two representative samples of your data set, e. g. for training and testing, by finding close pairs of data points and allocating each of them to these two separate partitions. TL however finds close pairs of the minority and majority class and then drops off the majority class points from those pairs. As a result we have better balanced sample and better separated.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    This makes a lot of sense, thanks @jacobcybulski
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.