Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Decision Tree (Parallel) randomness in numerical attribute's splits?
Hi, how are you?
I've been using Decision Tree (Parallel) and I noticed there's a huge difference with using the non-parallel version of the node.
Using exactly the same attributes, parameters and sample, numerical attributes get different splits whenever I run the node again, while the non-parallel version will always produce the same splits and exactly the same trees.
This has something to do with splitting processing in multiple threads, but what is going on exactly?
Check the following process:
You can clearly see the difference between running 3 times a non-parallel and a parallel Decision Tree. You can also change number of threads to 1 and see how the trees become identical.
Thanks for your insight, best regards.
I've been using Decision Tree (Parallel) and I noticed there's a huge difference with using the non-parallel version of the node.
Using exactly the same attributes, parameters and sample, numerical attributes get different splits whenever I run the node again, while the non-parallel version will always produce the same splits and exactly the same trees.
This has something to do with splitting processing in multiple threads, but what is going on exactly?
Check the following process:
You can clearly see the difference between running 3 times a non-parallel and a parallel Decision Tree. You can also change number of threads to 1 and see how the trees become identical.
Thanks for your insight, best regards.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/>
<operator activated="true" class="loop" compatibility="5.3.015" expanded="true" height="76" name="LOOP DT PAR" width="90" x="380" y="120">
<parameter key="iterations" value="3"/>
<process expanded="true">
<operator activated="true" class="parallel:decision_tree_parallel" compatibility="5.3.000" expanded="true" height="76" name="DT PAR" width="90" x="112" y="30">
<parameter key="number_of_threads" value="2"/>
</operator>
<connect from_port="input 1" to_op="DT PAR" to_port="training set"/>
<connect from_op="DT PAR" from_port="model" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="loop" compatibility="5.3.015" expanded="true" height="76" name="LOOP DT" width="90" x="380" y="30">
<parameter key="iterations" value="3"/>
<process expanded="true">
<operator activated="true" class="decision_tree" compatibility="5.3.015" expanded="true" height="76" name="DT" width="90" x="179" y="30"/>
<connect from_port="input 1" to_op="DT" to_port="training set"/>
<connect from_op="DT" from_port="model" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="LOOP DT" to_port="input 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="LOOP DT PAR" to_port="input 1"/>
<connect from_op="LOOP DT PAR" from_port="output 1" to_port="result 2"/>
<connect from_op="LOOP DT" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
From my understanding, diversity with this operator comes from using a subset of attributes, therefore using subset ratio 1 should give every tree all attributes and therefore produce identical trees.
we coded a new decision tree in version 6.3., so i can not reproduce your code.
It could be that this was a known issue fixed in v 6.X
Cheers,
Martin
Dortmund, Germany
We are looking forward to upgrade to 6.X whenever we can afford it.
Cheers!