🥳 RAPIDMINER 9.9 IS OUT!!! 🥳

The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.

CLICK HERE TO DOWNLOAD

Why are the distributions in trees of Random Forests incorrect?

FriedemannFriedemann Member, University Professor Posts: 27  University Professor
Im using a Random Forest to discover rules based on a simple dataset. After computing the model I check the trees to find leaves with a high confidence. when comparing the number of records shown by the tree description with the data in the dataset it turns out that the numbers are wrong. For instance, I have one attribute with a 50/50 distribution (greater than 0 and less than 0). The tree has the correct split value (0) but has 10 more records in the left branch.

Any ideas? 

Best Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,898  RM Data Scientist
    Solution Accepted
    Hi,
    can you maybe provide an example for this?

    Keep in mind that a Random Forest works on a bootstrapped set of the original data set, this may explain deviations.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,898  RM Data Scientist
    edited February 24 Solution Accepted
    of course. There are usually two factors, which make a Random Forest random.

    First each node is only 'seeing' a subset of all attributes and than taking the best split in them.

    Second, each tree is trained not on the full data set, but on 90% of the original data set. This 90% is not a random sample, but it is a bootstrapped sample. This means examples can be taken twice or even three times. (Ziehen mit Zurücklegen).
    Have a look at the following process, which generates a  forest, which only consists of "root nodes". You can see that each root node as different distributions of yes/no. This is because of this bootstrapping.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.8.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="179" y="34">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.8.001" expanded="true" height="103" name="Random Forest" width="90" x="313" y="34">
            <parameter key="number_of_trees" value="10"/>
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="1"/>
            <parameter key="apply_pruning" value="false"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="false"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
            <parameter key="random_splits" value="false"/>
            <parameter key="guess_subset_ratio" value="true"/>
            <parameter key="subset_ratio" value="0.2"/>
            <parameter key="voting_strategy" value="confidence vote"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
          </operator>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Random Forest" to_port="training set"/>
          <connect from_op="Random Forest" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany

Answers

  • FriedemannFriedemann Member, University Professor Posts: 27  University Professor
    Sure, I can send you the data and the process via private message. However, I am not sure that I understand "bootstrapped" in this context. The trees of the Random Forest model indicate the very same number of records as contained in the input dataset. Can you please elaborare a bit on "bootstrapped"?
  • FriedemannFriedemann Member, University Professor Posts: 27  University Professor
    Thanks for the clarification. Then why do the trees show the number of records of the original dataset in my case?
Sign In or Register to comment.