Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Behavior of the Process random seed (and RNG)
Hello everyone,
I have a process in which I train a Random Forest to do some classification for me and I set the "use local random seed" flag to false and the Process random seed to a certain value.
So when I set the flag to true and set the local random seed of the Random Forest Operator to the same value to which I had set the Process random seed before, I notice that the process results for both cases are different.
I get yet another result when I embed the Random Forest inside a Subprocess Operator having the "use local random seed" flag of the Random Forest set to false.
Is this behavior intended? Does the Random Forest even use the Process random seed when its local seed flag is set to false? And if so, does the random number generator work differently for different Operators, even if the same random seeds are used?
I am a little puzzled here. So thanks in advance to anyone who can enlighten me!
I have a process in which I train a Random Forest to do some classification for me and I set the "use local random seed" flag to false and the Process random seed to a certain value.
So when I set the flag to true and set the local random seed of the Random Forest Operator to the same value to which I had set the Process random seed before, I notice that the process results for both cases are different.
I get yet another result when I embed the Random Forest inside a Subprocess Operator having the "use local random seed" flag of the Random Forest set to false.
Is this behavior intended? Does the Random Forest even use the Process random seed when its local seed flag is set to false? And if so, does the random number generator work differently for different Operators, even if the same random seeds are used?
I am a little puzzled here. So thanks in advance to anyone who can enlighten me!
0
Answers
what you have to know first is, that each random seed generates a fixed sequence of random numbers.
The process random seed is used by each operator. So if you for example first generate data, the first numbers of sequence will be consumed and if you apply a Random Forest after this, it will receive different numbers than if it would start the same sequence locally.
So you have to take a look at each operator using random numbers in the process to determine that actually consumed part of the sequence is really the same.
Greetings,
Sebastian
I set any operator using random seeds to local seed values (including my random forest). I then varied the seed value of one of my sampling operators and there was nearly no reaction in the observed performance criterion to this.
Next, I set the random forest not to use a local seed, which to my understanding should mean that it uses the random numbers generated from the process seed. Since all other operators are set to local seeds, my expectation is that the forest should use the same random numbers each run, which should amount to the same process behavior as in the case with a local constant random seed in the random forest operator.
Running this setup however yields different performance ratings, varying between 3 mean values, the same behavior I get when I set the random forest to a local seed and observe the reaction of the performance to a variation of that local seed.
So what does this mean? It looks like the forest is not using the same numbers each run at all. But shouldn't it?
well at least it is called 'Random' Forest But I guess it should be not that random...
Please add a bug to our bug tracker and attach a process illustrating the problem, that's completely independent from any of your data. (Replace them with Data Generators, but be careful: They use random numbers )
Greetings,
Sebastian
In the report I have added a comment on another strange effect when working with the Random Forest. Please look into this soon.
All this seemed pretty strange so I wrapped up your process in a parameter iteration, and logged the results. From those results a fairly concise rule was induced... Which indeed says that turning on local random seeding decreases accuracy in this setup, Seems counter-intuitive to me but what do I know? On the plus side the behaviour is consistent, so this may not actually be a bug.
Here's the code..
It was not actually my intention to point out that local random seeding would decrease accuracy.
If you turn on the local random seed in the Random Forest and vary its value, the accuracy seems to actually fluctuate between several mean values.
For instance, I take the setup I provided in my bug report, set the Sample (3) operator to a sample ratio of 0.1 and to a local seed of 1991, the Sample true and Sample false operators to local seeds of 1993 and then vary the local seed value of the Random Forest operator (using 10 trees) among the first fifty prime numbers, I get accuracy values around 0.51, 0.71, 0.756, 0.77 and 0.857 and only those.
As I remember, the random bit about the forests is the number of attributes considered in making the trees, so with only seven attributes to pick from in this case maybe only a limited number of performance possibilities show. I'm still pondering why using a local seed appears to impair random forest performance, distinctly odd.
Ciao !
well this seems to be strange. But with only a few attributes the trees cannot grow very differently. This might explain the few different performance values. That the performance decreases with a local random seed probably only results from a bad random seed: Each fold of the CrossValidation will now be learned with the same random sequence number and hence the same used attributes. If this attribute sequence does not fit the data: Bad luck. If it does it will probably result in better performance.
Nevertheless I will take a look as soon as possible, but this might take some time...
Greetings,
Sebastian
When the RapidMiner Random Forest is replaced by a Weka Random Forest, there are no performance fluctuations at all. The performance doesn't seem to have as high a peak value as the RapidMiner Random Forest though so I'd rather use the RapidMiner one...
don't know how the weka one is implemented. Might be they always use a local random seed...
Greetings,
Sebastian