Options

Implementation of Random Forest (versus Decision Forest?)

ollestratollestrat Member Posts: 9 Contributor II
Hi all,

I'm wondering how the RapidMiner RandomForest classifier is implemented. It seems to me that there are significant differences to the version of Breiman (BREIMAN, L: Random Forests Machine Learning, 45, 5–32, 2001).

Main features of Random Forests are:
- each tree grows on his individual bootstrap sample set
- at each node of the trees, a defined number of features is randomly selected and evaluated for the best split

Is the RapidMiner RandomForest classifier working like that? Are individual trees grown on bootstrap samples? And  I suppose the number of features is rather determined for the hole tree, not for each node (?). If so it would rather resemble the "Decision Forest" of Ho ( Ho, T.K. 1998: The Random Subspace Method for Constructing Decision Forests. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 8, AUGUST 1998).

The WEKA-version of the Random Forest classifier seems to follow Breiman's concept I guess, so this could be the choice anyway, however: the "Weight by Tree Importance"-operator which I would like to use does not work with the WEKA-version.

Thanks in advance.

Ollestrat


Answers

  • Options
    ollestratollestrat Member Posts: 9 Contributor II
    one additional comment: I had a look at the java code of the RandomForest Learner:
    bootstrapping.setParameter(BootstrappingOperator.PARAMETER_SAMPLE_RATIO, "1.0")
    it seems there is bootrapping implemented, however the sample ratio is set to 1.0 and the RandomForest operator does not offer the according parameter to adjust this ratio. Does this mean that all trees are always grown on the whole example set (thus no bootstrapping)?

    Would be great to get some help  as I am completely lost on the code level.
    public class RandomForestLearner extends RandomTreeLearner {

    /** The parameter name for the number of trees. */
    public static final String PARAMETER_NUMBER_OF_TREES = "number_of_trees";

    public RandomForestLearner(OperatorDescription description) {
    super(description);
    }

    @Override
    public Class<? extends PredictionModel> getModelClass() {
    return RandomForestModel.class;
    }

    @Override
    public Model learn(ExampleSet exampleSet) throws OperatorException {
    BootstrappingOperator bootstrapping = null;
    try {
    bootstrapping = OperatorService.createOperator(BootstrappingOperator.class);
    bootstrapping.setParameter(BootstrappingOperator.PARAMETER_USE_WEIGHTS, "false");
    bootstrapping.setParameter(BootstrappingOperator.PARAMETER_SAMPLE_RATIO, "1.0");
    } catch (OperatorCreationException e) {
    throw new OperatorException(getName() + ": cannot construct random tree learner: " + e.getMessage());
    }

    // learn base models
    List<TreeModel> baseModels = new LinkedList<TreeModel>();
    int numberOfTrees = getParameterAsInt(PARAMETER_NUMBER_OF_TREES);

    for (int i = 0; i < numberOfTrees; i++) {
    TreeModel model = (TreeModel)super.learn(bootstrapping.apply(exampleSet));
    model.setSource(getName());
    baseModels.add(model);
    }

    // create and return model
    return new RandomForestModel(exampleSet, baseModels);
    }
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello,

    I have forwarded your question to one of our developers, maybe they can tell us more.

    However, at least about the bootstrapping I can say something:

    it seems there is bootrapping implemented, however the sample ratio is set to 1.0 and the RandomForest operator does not offer the according parameter to adjust this ratio. Does this mean that all trees are always grown on the whole example set (thus no bootstrapping)?
    No, this would at least only happen with a very small probability for decent data sets. Bootstrapping basically only means sampling with replacement where the sample size most often is the size of the original data set. If you use a sample ratio of 1 for a data set consisting of n examples, you will end up with n examples but several of them might be used more than once. Actually about 63% of the original set will be used, the rest is not part of the sample (but probably will be for another tree).


    I am not sure about where random attribute sets are used (I think it's per node but it also might be per tree). Maybe one of our developers can look it up (I would do this myself but I am currently not in my office...)

    Cheers,
    Ingo
Sign In or Register to comment.