Behavior of the Process random seed (and RNG)

KrafNKrafN Member Posts: 13 Contributor II
edited November 2018 in Help
Hello everyone,

I have a process in which I train a Random Forest to do some classification for me and I set the "use local random seed" flag to false and the Process random seed to a certain value.

So when I set the flag to true and set the local random seed of the Random Forest Operator to the same value to which I had set the Process random seed before, I notice that the process results for both cases are different.

I get yet another result when I embed the Random Forest inside a Subprocess Operator having the "use local random seed" flag of the Random Forest set to false.

Is this behavior intended? Does the Random Forest even use the Process random seed when its local seed flag is set to false? And if so, does the random number generator work differently for different Operators, even if the same random seeds are used?

I am a little puzzled here. So thanks in advance to anyone who can enlighten me!

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    what you have to know first is, that each random seed generates a fixed sequence of random numbers.
    The process random seed is used by each operator. So if you for example first generate data, the first numbers of sequence will be consumed and if you apply a Random Forest after this, it will receive different numbers than if it would start the same sequence locally.
    So you have to take a look at each operator using random numbers in the process to determine that actually consumed part of the sequence is really the same.

    Greetings,
      Sebastian
  • KrafNKrafN Member Posts: 13 Contributor II
    I have investigated a little further and this is what I observed:

    I set any operator using random seeds to local seed values (including my random forest). I then varied the seed value of one of my sampling operators and there was nearly no reaction in the observed performance criterion to this.

    Next, I set the random forest not to use a local seed, which to my understanding should mean that it uses the random numbers generated from the process seed. Since all other operators are set to local seeds, my expectation is that the forest should use the same random numbers each run, which should amount to the same process behavior as in the case with a local constant random seed in the random forest operator.

    Running this setup however yields different performance ratings, varying between 3 mean values, the same behavior I get when I set the random forest to a local seed and observe the reaction of the performance to a variation of that local seed.

    So what does this mean? It looks like the forest is not using the same numbers each run at all. But shouldn't it?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    well at least it is called 'Random' Forest :) But I guess it should be not that random...
    Please add a bug to our bug tracker and attach a process illustrating the problem, that's completely independent from any of your data. (Replace them with Data Generators, but be careful: They use random numbers :) )

    Greetings,
      Sebastian
  • KrafNKrafN Member Posts: 13 Contributor II
    I have created a bug report (including a process to reproduce the problem) here: http://bugs.rapid-i.com/show_bug.cgi?id=375

    In the report I have added a comment on another strange effect when working with the Random Forest. Please look into this soon.
  • haddockhaddock Member Posts: 849 Maven
    Hi there,

    All this seemed pretty strange so I wrapped up your process in a parameter iteration, and logged the results. From those results a fairly concise rule was induced...

    Random_Forest_Local_Seed = false
    |       Sample_True_Local_Seed = false: [0.7 - ∞] {[-∞ - 0.6]=0, [0.6 - 0.7]=1, [0.7 - ∞]=3}
    |       Sample_True_Local_Seed = true  
    |       |        Test_Sample_Local_Seed = false: [0.7 - ∞] {[-∞ - 0.6]=0, [0.6 - 0.7]=0, [0.7 - ∞]=2}
    |       |        Test_Sample_Local_Seed = true: [-∞ - 0.6] {[-∞ - 0.6]=2, [0.6 - 0.7]=0, [0.7 - ∞]=0}
    Random_Forest_Local_Seed = true: [-∞ - 0.6] {[-∞ - 0.6]=8, [0.6 - 0.7]=0, [0.7 - ∞]=0}.
    .
    Which indeed says that turning on local random seeding decreases accuracy in this setup, Seems counter-intuitive to me but what do I know? On the plus side the behaviour is consistent, so this may not actually be a bug.

    Here's the code..
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
       <process expanded="true" height="602" width="835">
         <operator activated="true" class="generate_sales_data" compatibility="5.0.8" expanded="true" height="60" name="Generate Sales Data" width="90" x="45" y="30">
           <parameter key="number_examples" value="500000"/>
           <parameter key="use_local_random_seed" value="true"/>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="5.0.8" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="120">
           <list key="function_descriptions">
             <parameter key="label" value="((amount&gt;3)&amp;&amp;(single_price&gt;50)) || (product_id&gt;80000)"/>
           </list>
         </operator>
         <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="45" y="255">
           <parameter key="name" value="label"/>
           <parameter key="target_role" value="label"/>
         </operator>
         <operator activated="true" class="numerical_to_binominal" compatibility="5.0.8" expanded="true" height="76" name="Numerical to Binominal" width="90" x="45" y="390">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="label"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="loop_parameters" compatibility="5.0.8" expanded="true" height="76" name="Loop Parameters" width="90" x="246" y="30">
           <list key="parameters">
             <parameter key="Sample false.use_local_random_seed" value="true,false"/>
             <parameter key="Sample true.use_local_random_seed" value="true,false"/>
             <parameter key="Test.use_local_random_seed" value="true,false"/>
             <parameter key="Random Forest.use_local_random_seed" value="true,false"/>
           </list>
           <process expanded="true" height="451" width="701">
             <operator activated="true" class="multiply" compatibility="5.0.8" expanded="true" height="112" name="Multiply" width="90" x="45" y="30"/>
             <operator activated="true" class="sample_stratified" compatibility="5.0.8" expanded="true" height="76" name="Test" width="90" x="179" y="300">
               <parameter key="sample" value="relative"/>
               <parameter key="sample_ratio" value="1.0"/>
               <parameter key="local_random_seed" value="1995"/>
             </operator>
             <operator activated="true" class="filter_examples" compatibility="5.0.8" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="165">
               <parameter key="condition_class" value="attribute_value_filter"/>
               <parameter key="parameter_string" value="label=false"/>
             </operator>
             <operator activated="true" class="sample_stratified" compatibility="5.0.8" expanded="true" height="76" name="Sample false" width="90" x="313" y="165">
               <parameter key="sample" value="relative"/>
               <parameter key="local_random_seed" value="1993"/>
             </operator>
             <operator activated="true" class="filter_examples" compatibility="5.0.8" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
               <parameter key="condition_class" value="attribute_value_filter"/>
               <parameter key="parameter_string" value="label=true"/>
             </operator>
             <operator activated="true" class="sample_stratified" compatibility="5.0.8" expanded="true" height="76" name="Sample true" width="90" x="313" y="30">
               <parameter key="sample" value="relative"/>
               <parameter key="local_random_seed" value="583651"/>
             </operator>
             <operator activated="true" class="append" compatibility="5.0.8" expanded="true" height="94" name="Append" width="90" x="447" y="30"/>
             <operator activated="true" class="random_forest" compatibility="5.0.8" expanded="true" height="76" name="Random Forest" width="90" x="447" y="210">
               <parameter key="maximal_depth" value="10"/>
               <parameter key="local_random_seed" value="5463"/>
             </operator>
             <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="313" y="345">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance_binominal_classification" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="447" y="345">
               <parameter key="precision" value="true"/>
               <parameter key="recall" value="true"/>
             </operator>
             <operator activated="true" class="log" compatibility="5.0.8" expanded="true" height="76" name="Log" width="90" x="581" y="345">
               <list key="log">
                 <parameter key="Sample_False_Local_Seed" value="operator.Sample false.parameter.use_local_random_seed"/>
                 <parameter key="Sample_True_Local_Seed" value="operator.Sample true.parameter.use_local_random_seed"/>
                 <parameter key="Test_Sample_Local_Seed" value="operator.Test.parameter.use_local_random_seed"/>
                 <parameter key="Random_Forest_Local_Seed" value="operator.Random Forest.parameter.use_local_random_seed"/>
                 <parameter key="Accuracy" value="operator.Performance.value.accuracy"/>
               </list>
             </operator>
             <connect from_port="input 1" to_op="Multiply" to_port="input"/>
             <connect from_op="Multiply" from_port="output 1" to_op="Filter Examples" to_port="example set input"/>
             <connect from_op="Multiply" from_port="output 2" to_op="Filter Examples (2)" to_port="example set input"/>
             <connect from_op="Multiply" from_port="output 3" to_op="Test" to_port="example set input"/>
             <connect from_op="Test" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
             <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample false" to_port="example set input"/>
             <connect from_op="Sample false" from_port="example set output" to_op="Append" to_port="example set 2"/>
             <connect from_op="Filter Examples" from_port="example set output" to_op="Sample true" to_port="example set input"/>
             <connect from_op="Sample true" from_port="example set output" to_op="Append" to_port="example set 1"/>
             <connect from_op="Append" from_port="merged set" to_op="Random Forest" to_port="training set"/>
             <connect from_op="Random Forest" from_port="model" to_op="Apply Model" to_port="model"/>
             <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
             <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
             <connect from_op="Log" from_port="through 1" to_port="result 1"/>
             <portSpacing port="source_input 1" spacing="0"/>
             <portSpacing port="source_input 2" spacing="0"/>
             <portSpacing port="sink_performance" spacing="0"/>
             <portSpacing port="sink_result 1" spacing="0"/>
             <portSpacing port="sink_result 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="log_to_data" compatibility="5.0.8" expanded="true" height="94" name="Log to Data (2)" width="90" x="246" y="165"/>
         <operator activated="true" class="select_attributes" compatibility="5.0.8" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="300">
           <parameter key="attribute_filter_type" value="regular_expression"/>
           <parameter key="regular_expression" value=".*V"/>
           <parameter key="invert_selection" value="true"/>
         </operator>
         <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="30">
           <parameter key="name" value="Accuracy"/>
           <parameter key="target_role" value="label"/>
         </operator>
         <operator activated="true" class="discretize_by_bins" compatibility="5.0.8" expanded="true" height="94" name="Discretize" width="90" x="447" y="210">
           <parameter key="create_view" value="true"/>
           <parameter key="include_special_attributes" value="true"/>
           <parameter key="number_of_bins" value="3"/>
           <parameter key="range_name_type" value="interval"/>
         </operator>
         <operator activated="true" class="decision_tree" compatibility="5.0.8" expanded="true" height="76" name="Decision Tree" width="90" x="648" y="120"/>
         <connect from_op="Generate Sales Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
         <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Loop Parameters" to_port="input 1"/>
         <connect from_op="Loop Parameters" from_port="result 1" to_op="Log to Data (2)" to_port="through 1"/>
         <connect from_op="Log to Data (2)" from_port="exampleSet" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
         <connect from_op="Set Role (2)" from_port="example set output" to_op="Discretize" to_port="example set input"/>
         <connect from_op="Discretize" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
         <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="252"/>
       </process>
     </operator>
    </process>
  • KrafNKrafN Member Posts: 13 Contributor II
    Thank you for your contribution, haddock.

    It was not actually my intention to point out that local random seeding would decrease accuracy.

    If you turn on the local random seed in the Random Forest and vary its value, the accuracy seems to actually fluctuate between several mean values.

    For instance, I take the setup I provided in my bug report, set the Sample (3) operator to a sample ratio of 0.1 and to a local seed of 1991, the Sample true and Sample false operators to local seeds of 1993 and then vary the local seed value of the Random Forest operator (using 10 trees) among the first fifty prime numbers, I get accuracy values around 0.51, 0.71, 0.756, 0.77 and 0.857 and only those.
  • haddockhaddock Member Posts: 849 Maven
    Hi again,

    As I remember, the random bit about the forests is the number of attributes considered in making the trees, so with only seven attributes to pick from in this case maybe only a limited number of performance possibilities show. I'm still pondering why using a local seed appears to impair random forest performance, distinctly odd.

    Ciao !

  • KrafNKrafN Member Posts: 13 Contributor II
    With 100 trees there should not be any significant performance fluctuation at all!
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    well this seems to be strange. But with only a few attributes the trees cannot grow very differently. This might explain the few different performance values. That the performance decreases with a local random seed probably only results from a bad random seed: Each fold of the CrossValidation will now be learned with the same random sequence number and hence the same used attributes. If this attribute sequence does not fit the data: Bad luck. If it does it will probably result in better performance.
    Nevertheless I will take a look as soon as possible, but this might take some time...

    Greetings,
      Sebastian
  • KrafNKrafN Member Posts: 13 Contributor II
    I have found out one more thing that might help:

    When the RapidMiner Random Forest is replaced by a Weka Random Forest, there are no performance fluctuations at all. The performance doesn't seem to have as high a peak value as the RapidMiner Random Forest though so I'd rather use the RapidMiner one...
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    don't know how the weka one is implemented. Might be they always use a local random seed...

    Greetings,
      Sebastian
Sign In or Register to comment.