Options

"Some bugs connected with data sampling"

wokonwokon Member Posts: 8 Contributor II
edited May 2019 in Help
When trying to sample data out of a big data set I stumbled over several errors connected with data sampling
  • 1. Changing rapidminer.general.randomseed = -1 in Tools – Preferences has not the desired effect: Reopening the Preferences window shows that rapidminer.general.randomseed is set to 1 instead to -1. When running ExampleSource with sample_ratio=0.5, you get always the same sequence.
  • 2. Changing rapidminer.general.randomseed = -1 in .rapidminer/ 4_2_0_rapidminerrc.Windows XP works, now we get different samples in each run. However, a warning message appears when opening the Preferences dialog box: “Illegal value '-1' for parameter 'rapidminer.general.randomseed' has been corrected to '1'.”  (???) Nevertheless, the system behaves still in the same way as if -1 is in effect.
  • 3. Changing now rapidminer.general.randomseed in the Preferences dialog box to any positive value, e.g. 42, and "Apply" & "Save" leaves the random behaviour untouched (different samples in each run).  Only when restarting RapidMiner, the new setting "42" takes effect >> the same sample A is produced in every run.
  • 4. Changing rapidminer.general.randomseed in the Preferences dialog box to any other positive value, e.g. 84, and "Apply" & "Save" leaves the random behaviour untouched (same sample A in each run).  Only when restarting RapidMiner, the new setting "84" takes effect >> a new and always same sample B is produced in every run.
  • 5. When having rapidminer.general.randomseed = -1 only sample_ratio<1.0 will have the effect of generating different samples in each run. When sample_ratio=1.0 and sample_size=1000 (in a 50000-record dataset), then each run will produce the same sequence of 1000 records, not 1000 different records. So there seems to be no randomness in sample_size.
  • 6. Most disturbing: If I use the operator Sampling  and set its parameter local_random_seed to any value different from -1, then any incoming dataset is reduced to 0 records on output, irrespective how large the sample_ratio is!!. This leaves me in a rather puzzled state  ???
I'm using RapidMiner 4.2 under Windows XP and this is the code I use together with the AML- and DAT-file in the attachment:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource" breakpoints="after">
        <parameter key="attributes" value="dmc2007_train_small.aml"/>
        <parameter key="sample_ratio" value="0.5"/>
    </operator>
    <operator name="Sampling" class="Sampling">
        <parameter key="sample_ratio" value="0.2"/>
    </operator>
</operator>
Am I really the first one noting this somewhat strange behaviour or am I doing something in an unexpected way? Isn't it strange that there is no way to achieve a "random random seed" by any means from the GUI, although the tooltip says, that -1 would do it?

Any clarifications are greatly appreciated

Best regards

Wolfgang

[attachment deleted by admin]

Answers

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello Wolfgang

    First of all: Is there a specific reason you refuse to work with the latest version of RapidMiner ;) ?

    Second:
    I can confirm that the changing of global random seed in the preferences dialog does not work.

    Third:
    If would use 4.4 you could set the parameters like this to gain different samples:

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="resultfile" value="C:\Dokumente und Einstellungen\wolfgang\Eigene Dateien\rm_workspace\DMC2007-rm\test.res"/>
        <parameter key="random_seed" value="2001"/>
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="dmc2007_train_small.aml"/>
            <parameter key="sample_ratio" value="0.5"/>
        </operator>
        <operator name="Sampling" class="Sampling">
            <parameter key="sample_ratio" value="0.2"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="5"/>
        </operator>
    </operator>
    kind regards,

    Steffen

    PS: A lot of bugs have been fixed since 4.2, so please consider the latest version
    PPS: You deserve an award for your error descriptions. Clearly about average !
  • Options
    wokonwokon Member Posts: 8 Contributor II
    Hello Steffen,

    thanks again for your fast response. According to your hint I switched now to RapidMiner 4.4 (the reason for not using it in the first place was that I read some posts here in the forum relating to things which used to work in former versions but had some problems in 4.4), and it works so far very well on my platform.

    I have to admit, that most of the data sampling bugs described above are gone in RapidMiner 4.4. Especially with 4.4 it is well possible to change rapidminer.general.randomseed to -1. The only items remaining can be considered not as bugs, but as features:

    a) The setting of rapidminer.general.randomseed does not take effect immediately but only after a restart of RapidMiner (okay, this is not exactly the behaviour you expect from an "Apply" button...). The reason for this might be that the operator Root has its own parameter random_seed which is only filled at startup (very probably from rapidminer.general.randomseed ). If you change Root's random_seed to -1, 42 or 84 you get immediately the desired effects.
    [Perhaps something to work out in a further and future appendix of the documentation ...  ;) )

    b) Only remaining is the fact that parameter sample_size in operator ExampleSource produces always the same sequence irrespective of what the global or local random seed actually is. But as I said, it can be considered as bug, not as feature...

    So I apologize for bothering you with bugs mostly from earlier versions.

    And thanks for the PPS  :)

    Best regards
    Wolfgang
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello Wolfgang,

    a) The setting of rapidminer.general.randomseed does not take effect immediately but only after a restart of RapidMiner (okay, this is not exactly the behaviour you expect from an "Apply" button...). The reason for this might be that the operator Root has its own parameter random_seed which is only filled at startup (very probably from rapidminer.general.randomseed ). If you change Root's random_seed to -1, 42 or 84 you get immediately the desired effects.
    [Perhaps something to work out in a further and future appendix of the documentation ...  Wink )
    Oh, you are right. Never thought about this. I will add this to our todo and we will check if we can workaround this - or if we simply have to document it  ;)

    b) Only remaining is the fact that parameter sample_size in operator ExampleSource produces always the same sequence irrespective of what the global or local random seed actually is. But as I said, it can be considered as bug, not as feature...
    A clear feature  ;)

    The reason for this behavior is that we try to prevent the loading of all examples and to skip most of them again which of course is necessary for large files. If you do not want an exact but only a rough number of examples you could use the parameter "sample_ratio" instead. Or just use one of the sampling operators after loading.


    Thanks for the hints and cheers,
    Ingo
Sign In or Register to comment.