"Decision Tree (Parallel) randomness in numerical attribute's splits?"

mafern76mafern76 Member Posts: 45 Contributor II
edited June 14 in Help
Hi, how are you?

I've been using Decision Tree (Parallel) and I noticed there's a huge difference with using the non-parallel version of the node.

Using exactly the same attributes, parameters and sample, numerical attributes get different splits whenever I run the node again, while the non-parallel version will always produce the same splits and exactly the same trees.

This has something to do with splitting processing in multiple threads, but what is going on exactly?

Check the following process:

You can clearly see the difference between running 3 times a non-parallel and a parallel Decision Tree. You can also change number of threads to 1 and see how the trees become identical.

Thanks for your insight, best regards.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
      <operator activated="true" class="loop" compatibility="5.3.015" expanded="true" height="76" name="LOOP DT PAR" width="90" x="380" y="120">
        <parameter key="iterations" value="3"/>
        <process expanded="true">
          <operator activated="true" class="parallel:decision_tree_parallel" compatibility="5.3.000" expanded="true" height="76" name="DT PAR" width="90" x="112" y="30">
            <parameter key="number_of_threads" value="2"/>
          </operator>
          <connect from_port="input 1" to_op="DT PAR" to_port="training set"/>
          <connect from_op="DT PAR" from_port="model" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop" compatibility="5.3.015" expanded="true" height="76" name="LOOP DT" width="90" x="380" y="30">
        <parameter key="iterations" value="3"/>
        <process expanded="true">
          <operator activated="true" class="decision_tree" compatibility="5.3.015" expanded="true" height="76" name="DT" width="90" x="179" y="30"/>
          <connect from_port="input 1" to_op="DT" to_port="training set"/>
          <connect from_op="DT" from_port="model" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="LOOP DT" to_port="input 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="LOOP DT PAR" to_port="input 1"/>
      <connect from_op="LOOP DT PAR" from_port="output 1" to_port="result 2"/>
      <connect from_op="LOOP DT" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • mafern76mafern76 Member Posts: 45 Contributor II
    I don't know if this helps, but when using Random Forest with subset ratio = 1, the same thing happens.

    From my understanding, diversity with this operator comes from using a subset of attributes, therefore using subset ratio 1 should give every tree all attributes and therefore produce identical trees.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,049  RM Data Scientist
    Hi

    we coded a new decision tree in version 6.3., so i can not reproduce your code.
    It could be that this was a known issue fixed in v 6.X

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Hi Martin, thanks for your insight.

    We are looking forward to upgrade to 6.X whenever we can afford it.

    Cheers!
Sign In or Register to comment.