The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Question about Decision Tree / WEKA SimpleCART

GhostriderGhostrider Member Posts: 60 Contributor II
edited December 2019 in Help
I have a lot of data which is labeled into 4-5 classification groups.  I have 3-4 positive groups and 1 negative group.  I'm really interested in classifying the 3-4 positive groups, but the negative group makes up > 99% of the data.  So if I try to optimize for accuracy, I end up with a tree with 1000 nodes, basically just curve fitting the data.  If I set a minimum number of instances per node very high, in the extreme case, it just assigns everything to the positive group.  Does anyone have some suggestions for dealing with this issue?  Anyone know of a good guide for WEKA parameter tuning?
Tagged:

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    in principle there is not so much of a difference if you tune the Weka parameters or those of RapidMiner operators. Did you already try the parameter optimization operators which could try to automatically find optimal parameter combinations?

    Cheers,
    Ingo
  • Options
    GhostriderGhostrider Member Posts: 60 Contributor II
    Ingo Mierswa wrote:

    Hi,

    in principle there is not so much of a difference if you tune the Weka parameters or those of RapidMiner operators. Did you already try the parameter optimization operators which could try to automatically find optimal parameter combinations?

    Cheers,
    Ingo
    Not yet, I'll check them out.
  • Options
    harrisharris Member Posts: 8 Contributor II
    Ghostrider wrote:

    I have a lot of data which is labeled into 4-5 classification groups.  I have 3-4 positive groups and 1 negative group.  I'm really interested in classifying the 3-4 positive groups, but the negative group makes up > 99% of the data.  So if I try to optimize for accuracy, I end up with a tree with 1000 nodes, basically just curve fitting the data.  If I set a minimum number of instances per node very high, in the extreme case, it just assigns everything to the positive group.  Does anyone have some suggestions for dealing with this issue?  Anyone know of a good guide for WEKA parameter tuning?
    Hi ghostrider.

    Generic solution to imbalanced dataset problem is undersampling the dominant (negative) class so that the resulting new dataset is more balanced. This will ensure the formation of a 'class balanced' model (decision tree or other classifier) and class balanced predictions.

    Not sure how over/undersampling is done in rapidminer (am a relative newbie) but I know weka better (if you can use that): see wekalist archives for "undersampling" or "imbalanced" keywords for a  weka undersampling procedure as developed by me. (Unfortunately, weka filters required to implement that have not been subsumed under rapidminer).

    As regards your other question, use W-GridSearch in rapidminer to implement weka parameter tuning. I'm sure rapidminer has its own equivalent optimisation schemes (OptimizeParameters?).

    best, Harri Saarikoski
  • Options
    Fred12Fred12 Member Posts: 344 Unicorn

    hi,
    I have a question concerning undersampling:

    how is it best done, I mean it could be the case that you choose the undersampled part of the negative class, that is very similar to just the 0.01% positive class.. or it could be that just that undersampled negative class is very different to the positive minority class... which gives extremely different decision trees at the end I guess..?

     

    what is recommended for that case then? Split the majority negative class into several portions with sizes equal to the positive class...and use them each for one round to built the decision tree? and then average / majority vote all trees at the end?

Sign In or Register to comment.