Random Forest overfitting

santasanta Member Posts: 1 Contributor I
Hi everybody,

I am running RF for a typical binary classification problem (with 34
cases and 530 variables = training set) using RapidMiner. The first
and major problem is that the algorithm is producing every-time a 100%
performance on this training set which CAN'T be true and which makes
me strongly believe that it is doing over-fitting. But the algorithm
developers, Breiman & Cutler, have specifically remarked that RF
doesn't do over-fitting. So I am wondering if other people have
similar experience and suggestion on how to avoid it.

I have tried all sort of options to avoid it (like pruning, increasing
number of trees, increasing variables at each node etc.). The thing I
have not done (and am not willing to do) is to reduce the number of
variables as I want to run it in an unbiased way without having some
'a priori' selection of 'important' variables. Moreover, as far as the
literature goes, RF should do well where variables are large in
numbers and cases are small (n << p).

Any bit of help/suggestion will be highly appreciated.

TIA!
san

Answers

  • wesselwessel Member Posts: 537  Guru
    A dataset with only 34 data rows, and 530 variables in each row, this is a hard problem.
    If it does not work, use another classifier?
    Maybe boosting of decision stumps.

    Why would a random forest not over-fit?
    As far as I recall a random forest selects a variable randomly, and then calculates the optimal split for this variable.
    By iterating this process it creates a single tree. By bagging this method multiple trees can be created to form an ensemble.
    Surely calculating the optimal split is a process likely to over-fit?

    A learning algorithm with a random component has a big variance.
    Bagging is used to reduce this variance.
    Bagging is not some kind of magic which can prevent over-fitting.
  • ollestratollestrat Member Posts: 9 Contributor II
    As far as I understand the "classical" RF of Breiman it uses the bootstrapped cases for growing a tree (at each node a specified number of variables is randomly chosen, then best split criterion is applied). The left-out cases (out of bag) are send through the tree and used for the majority voting and the accuracy estimation.

    However: the RapidMiner RF seems to do the random variable selection for the tree, not for the nodes. Thus ONE set of variables is used for the whole tree. As this would be a distinct difference to the Breimans version, I'm also not sure about the implementation of the accuracy estimation. Maybe the estimation is not based on the left-out cases for their according tree, but on the whole example set running through the forest. Because latter case could easily lead to  your 100 percent accuracy as it would be testing on the training set (also high variance of the classifier and such a small example set).

    But I'm not sure about it all. I rather raised a question than answering yours,

    greetings
  • wesselwessel Member Posts: 537  Guru
    Some more details are needed here.

    Bagging is short for bootstrap aggregating.

    I'm not sure I understand your comment on the workings on RapidMiner RF:
    Accuracy estimation running trough the forest? Eh? Accuracy estimation is an internal processes in RF?
  • ollestratollestrat Member Posts: 9 Contributor II
    the "out of bag error" is an internal accuracy estimation (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm), but I just realized that its not implemented in the RapidMiner-RF version, but only for the WEKA-RF version. I mixed something up here.
Sign In or Register to comment.