RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

"Performance Criterion Unbalanced Data"

OzoneOzone Member Posts: 17  Maven
edited May 2019 in Help
Hi Community,

there are a lot of discussions about unbalanced data in this forum, but i cannot crack my problem anyway.

What I have:
1. I have very unbalanced data ( 3000 positive, 80 negative ) with about 60 predictors (numeric) and one label (binary)

What I want:
1. I try to build the best decision tree on that data as possible.

What I did:
1. I split data in test and training data
2. IN TRAINING DATA I oversampled negative class by factor 20 (bootsstrap sample) and undersampled positive class by factor 0.4 -> better balanced data
3. I used several feature selection algorithms
4. I build a decision tree on selected features ( minimal leave size: 15, minimal size to split: 100)
5. I applied it to the original TEST data set to get a performance value.

What is my problem:
1. What performance criterion should I use for TRAINING and/or TESTING. Because of unbalanced data I decided to use AUC. Is it the best choice?! Remark: Training Data is more balanced than testing data.

I'm sure you had the similar problems, too?!








Tagged:

Answers

  • OzoneOzone Member Posts: 17  Maven
    I have to add another question:

    Is it a good idea to set the minimal leaf size to the oversample ratio ( for the rare case ) in case of extremely unbalanced data?!

    I think there are two reasons why doing it like this:

    Assume that each instance of the rare class is important and should be classified correctly, then:

    1. a minimal leaf size <  than the oversample ratio may leads to overfitting
    2. a minimal leaf size  > than the oversample ratio may discount some (or all) of the rare cases

    Do you think that these assumptions and ideas are correct?!

    The second point leads to another questions: A look at my oversampled class (with bootsstrap sampling) shows that not every instance is copied to an equal extent. Is there any rule how much copies of each sample are done with this sampling method??? I think there is only this method to oversample data?!

    I hope you can help me?!




Sign In or Register to comment.