🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance

CLICK HERE TO DOWNLOAD

Sample Operator - Probability

btibertbtibert Member, University Professor Posts: 92  Guru
I have to admit, I am having a hard time understanding the output from a Sample Operator when selecting probability as the sample parameter.  

For example, if I use Generate Data to create a 100 example ExampleSet, and I connect the Sample Operator with probability and .1, I get 7 records.  

In short, why is it not 10 records?   I am having a hard time wrapping my head around this.

Best Answer

  • jacobcybulskijacobcybulski Member, University Professor Posts: 365   Unicorn
    Solution Accepted
    Here the sample size is determined in a probabilistic way, from the normal distribution then the sample is selected randomly. I assume this could find its application in repeated resampling to avoid the bias attached to a fixed sample size. 

Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 365   Unicorn
    edited January 12
    @btibert you have not been lucky :) It is a probabilistic sampling so if you were to sample a 1000 times, the 0.1 probability samples drawn would be on average 10 examples. You can try to do the following experiment, in a loop of 1000 times vary the random seed from 0 to 1000 and check the sample size, you will then see a distribution of sample sizes, the more examples to start with and more samples requested the more normal distribution it is going to be. So when you sample once, you may not be lucky to have the sample size perfectly on the mean.
  • btibertbtibert Member, University Professor Posts: 92  Guru
    edited January 13
    @jacobcybulski
    Thanks for the help, but I suppose, let me ask this differently.  What exactly is the sampling doing under the hood?  Is it assigning every record a score from a distribution and only selecting those with a value <..1 or > .9?  I am just trying to wrap my head around this approach to sampling.  I tend to compare to R or python where I can set the number of random records, or the % of records I want.  The idea of probabilistic sampling is not something I have come across too often.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 365   Unicorn
    An application of such sampling can be in simulation, eg studying the lift operation during the morning peak, where the lifts would be filled to their legal limits but sometimes with a few people less and sometimes with a few people more, over the limit. The samples of lift loads could be taken from a data set of typical people of different gender, weights and heights. 
  • btibertbtibert Member, University Professor Posts: 92  Guru
    Thanks!
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,822  RM Data Scientist
    edited January 14

    I checked the code. What happens is something like this:

    df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    r = np.random.uniform(0,1,100)
    mask = np.where(r<0.1,True,False)
    df = df[mask]


    in python. There is some more code to make it efficient and so on. But in principal its that. For the rest i just second jacob.

    Cheers,
    Martin


    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    jacobcybulski
  • btibertbtibert Member, University Professor Posts: 92  Guru
    @mschmitz
     
    That is perfect.  Exactly what I was looking for!
Sign In or Register to comment.