"Compare Weka and RM Random Forest"

MuehliMan · January 2011

Hi,

During my last calculations I did from runs with Random Forest and the Weka Random Forest Operator using various options. Although I tried to make both operators equivalent (using the same number of trees, same local random seed, no minimal split features etc.) the perfomance of the two operators were still different. Here is the workflow I used for benchmarking the two operators.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
<process expanded="true" height="269" width="949">
<operator activated="true" class="generate_data" compatibility="5.1.001" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
<parameter key="number_of_attributes" value="20"/>
</operator>
<operator activated="true" class="discretize_by_user_specification" compatibility="5.1.001" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="include_special_attributes" value="true"/>
<list key="classes">
<parameter key="first" value="0.5"/>
<parameter key="last" value="Infinity"/>
</list>
</operator>
<operator activated="true" class="nominal_to_binominal" compatibility="5.1.001" expanded="true" height="94" name="Nominal to Binominal" width="90" x="313" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="5.1.001" expanded="true" height="94" name="Multiply" width="90" x="447" y="30"/>
<operator activated="true" class="weka:W-RandomForest" compatibility="5.0.001" expanded="true" height="76" name="W-RandomForest" width="90" x="581" y="120">
<parameter key="S" value="1992.0"/>
<parameter key="depth" value="5"/>
<parameter key="D" value="true"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="715" y="120">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="5.1.001" expanded="true" height="76" name="Weka Random Forest" width="90" x="849" y="120">
<parameter key="f_measure" value="true"/>
<parameter key="youden" value="true"/>
</operator>
<operator activated="true" class="random_forest" compatibility="5.1.001" expanded="true" height="76" name="Random Forest" width="90" x="581" y="30">
<parameter key="minimal_size_for_split" value="1"/>
<parameter key="minimal_leaf_size" value="1"/>
<parameter key="minimal_gain" value="0.0"/>
<parameter key="maximal_depth" value="5"/>
<parameter key="confidence" value="1.0E-7"/>
<parameter key="no_pre_pruning" value="true"/>
<parameter key="no_pruning" value="true"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.001" expanded="true" height="76" name="Apply Model" width="90" x="715" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="5.1.001" expanded="true" height="76" name="RM Random Forest" width="90" x="849" y="30">
<parameter key="f_measure" value="true"/>
<parameter key="youden" value="true"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Random Forest" to_port="training set"/>
<connect from_op="Multiply" from_port="output 2" to_op="W-RandomForest" to_port="training set"/>
<connect from_op="W-RandomForest" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="W-RandomForest" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Weka Random Forest" to_port="labelled data"/>
<connect from_op="Weka Random Forest" from_port="performance" to_port="result 2"/>
<connect from_op="Random Forest" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Random Forest" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="RM Random Forest" to_port="labelled data"/>
<connect from_op="RM Random Forest" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Is RM using a different implentation of the Random Forest and if so, what differences were useed?

Best regards,
Markus

land · January 2011

Hi,
this is a completely different implementation of the same algorithms. That means, there are many small differences in the way values are computed and stored causing for example small rounding deviations. In the end the results may look different.

Greetings,
Sebastian

MuehliMan · January 2011

Hi Sebastian,

if that is the case, why do I totally different values for the performance, given by accuary for example).

Wekas accuracy: 100%
RMs accuracy: 61%

(obtained with the posted workflow)

Best,
Markus

haddock · January 2011

Hi folks,

Mmm... if a model gets 100% accuracy when applied to its own training data, especially when that data is random, then one should suspect over-fitting, and low predictive power - so RapidMiner wins again

. Models are only good if they work equally well on unseen data, as this link explains..

http://en.wikipedia.org/wiki/Overfitting

Good weekend to all!

MuehliMan · January 2011

I am not interested who wins or looses, but in the reason why the results are different. This was a benchmarking example more than a real-use problem.
As far as I know Breiman included a bootstrapping as validation for the random forest to avoid overfitting.

BTW: Enabling or disabling the two pruning options do not change the results.

Nice weekend and a happy new year to you too!

Cheers,
Markus

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Compare Weka and RM Random Forest"

Answers