RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
"Performance Criterion Unbalanced Data"
there are a lot of discussions about unbalanced data in this forum, but i cannot crack my problem anyway.
What I have:
1. I have very unbalanced data ( 3000 positive, 80 negative ) with about 60 predictors (numeric) and one label (binary)
What I want:
1. I try to build the best decision tree on that data as possible.
What I did:
1. I split data in test and training data
2. IN TRAINING DATA I oversampled negative class by factor 20 (bootsstrap sample) and undersampled positive class by factor 0.4 -> better balanced data
3. I used several feature selection algorithms
4. I build a decision tree on selected features ( minimal leave size: 15, minimal size to split: 100)
5. I applied it to the original TEST data set to get a performance value.
What is my problem:
1. What performance criterion should I use for TRAINING and/or TESTING. Because of unbalanced data I decided to use AUC. Is it the best choice?! Remark: Training Data is more balanced than testing data.
I'm sure you had the similar problems, too?!