Problem with overfitting
I have a problem with overfitting.
It is a classification with 8 label values and 6 attributes with about 5.5 million values each.
By cross validation with 10 folds, my decision tree reaches an accuracy of about 93%. Unfortunately, when I apply the model to new data, I only get a test accuracy of 33%.
Can anyone tell me how to prevent overfitting on the training data?
I have chosen the following parameters for the decision tree:
criterion: information gain
maximum depth: 30
apply pruning: yes
apply prepruning: yes
minimum gain: 0.0
minimum leaf size: 1
minimum size for slit: 1
number of prepruning alternatives: 0