Random Forest overfitting
I am running RF for a typical binary classification problem (with 34
cases and 530 variables = training set) using RapidMiner. The first
and major problem is that the algorithm is producing every-time a 100%
performance on this training set which CAN'T be true and which makes
me strongly believe that it is doing over-fitting. But the algorithm
developers, Breiman & Cutler, have specifically remarked that RF
doesn't do over-fitting. So I am wondering if other people have
similar experience and suggestion on how to avoid it.
I have tried all sort of options to avoid it (like pruning, increasing
number of trees, increasing variables at each node etc.). The thing I
have not done (and am not willing to do) is to reduce the number of
variables as I want to run it in an unbiased way without having some
'a priori' selection of 'important' variables. Moreover, as far as the
literature goes, RF should do well where variables are large in
numbers and cases are small (n << p).
Any bit of help/suggestion will be highly appreciated.