RapidMiner

real random sampling on a small dataset

SOLVED
Maven
Maven

real random sampling on a small dataset

I'm trying to generate some random user agent strings by sampling a small exampleset with OS info, and another small exampleset with browser details. From each exampleset I want to take one random example, concatenate these and use it for some other processing later on.

 

I use the sample operator, absolute = 1 and this gives me indeed each time one single example from all of my sets. Unfortunatly it is each time exactly the same, so there seems no randomness involved. I assume this will only start as soon as you have a bigger set of examples but I would like to understand how to do this on a real small set also. Or how much example there are are needed to be able to get random results from the sample operator instead of each time the same one?

 

Attached some example showing the problem, the result is always the same even if it should be random in theory.

5 REPLIES
RM Certified Expert
RM Certified Expert

Re: real random sampling on a small dataset

Your using the system seed random generator, its the same everytime. try using a different seed.

Maven
Maven

Re: real random sampling on a small dataset

Thanks @Thomas_Ott, how would that work then in practice ?

 

I've tried the same with setting the 'use local random seed' but it still seems always the same in the end. I do get different values when I change the local random seed value, but they are then also always the same if I rerun the operator. Or am I doing this wrong?

 

What I would like to achieve is that each time when I run the process a different single example is sampled from my set, totally random.

 

I probably could create a macro using the random function and use that as an entry for the random seed number, but it looks a bit like overcomplicating things. Also the random function does not give me much options to generate a number between 1 and 1992 (max value allowed) 

RM Certified Expert
RM Certified Expert

Re: real random sampling on a small dataset

The reason why the random seed is the same if you set it to like 1992 is because academic researchers. They need something reproducible for peer-review for a particular randome number set. Let see if @Edin_Klapic might now if there is a purely random number generator inside RapidMiner Studio. 

RM Staff
RM Staff
Solution

Re: real random sampling on a small dataset

Hi,

 

use generate macro with

date_millis(date_now())%10000

and use it as a random seed.


That should do it.

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Highlighted
RM Staff
RM Staff
Solution

Re: real random sampling on a small dataset

Hi,

 

you could also set the random generator of the process to be initialized with the system time. You can achieve that by setting the random seed of the process to -1. If you then don't use a local random seed for the sampling operator, the result will differ everytime you start the process.