"Why doesn't Split-Data Inherit the Global Random Seed?"

Panj1Panj1 Member Posts: 2 Contributor I
edited June 2019 in Help

I retrieved the Titanic dataset, than multipied it, and Copy and paste 3x split data operators at a 0.7/0.3 split. The data results are different each time. Now, I can set the local random seed to something in order to make sure it splits exactly the same each time, but I would've expected in this case that it inherits the Global Random Seed by default. Is this expected behavior? It seems unintuitive if it is.  

 

If it is expected behavior, is there an option somewhere to force randomization operators to use a global random seed? 

 

I am using RapidMiner Studio 8.2.

 

Titanic Split Data Test.png

Thank You,

 

Please see XML below. 

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="340">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="124" name="Multiply" width="90" x="179" y="391"/>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (4)" width="90" x="447" y="595">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (3)" width="90" x="447" y="442">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (2)" width="90" x="447" y="289">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic (2)" width="90" x="45" y="748">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (5)" width="90" x="313" y="748">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic (3)" width="90" x="45" y="850">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (6)" width="90" x="313" y="850">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Split Data (2)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Split Data (3)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Split Data (4)" to_port="example set"/>
<connect from_op="Split Data (4)" from_port="partition 1" to_port="result 3"/>
<connect from_op="Split Data (3)" from_port="partition 1" to_port="result 2"/>
<connect from_op="Split Data (2)" from_port="partition 1" to_port="result 1"/>
<connect from_op="Retrieve Titanic (2)" from_port="output" to_op="Split Data (5)" to_port="example set"/>
<connect from_op="Split Data (5)" from_port="partition 1" to_port="result 4"/>
<connect from_op="Retrieve Titanic (3)" from_port="output" to_op="Split Data (6)" to_port="example set"/>
<connect from_op="Split Data (6)" from_port="partition 1" to_port="result 5"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

Tagged:

Best Answer

  • jczogallajczogalla Employee, Member Posts: 144 RM Engineering
    Solution Accepted

    Hi Panj1!

    Welcome to the community. :) As a tip, you can use the "</>" button while writing your post to have a nice formatted version of your XML.This prevents conversion of part of the XML to smilies for example.

     

    Regarding the random generator question: The operators of course use the global random generator by default, but since it is the global random generator, it will progress with each operator that uses it. This means that as long as you keep the execution order the same, the end results will stay the same between process executions. But if you want two split operators to produce the same partitions, those two need to have the same local random seed. This is also the case for loops.

    If you just want to split the same data set multiple times the same way, you can also use the split operator once and multiply its outputs, example XML below.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.3.000-SNAPSHOT">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.3.000-SNAPSHOT" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.3.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="340">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="8.3.000-SNAPSHOT" expanded="true" height="103" name="Split Data" width="90" x="179" y="340">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.7"/>
    <parameter key="ratio" value="0.3"/>
    </enumeration>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.3.000-SNAPSHOT" expanded="true" height="82" name="Multiply" width="90" x="380" y="340"/>
    <operator activated="true" class="multiply" compatibility="8.3.000-SNAPSHOT" expanded="true" height="82" name="Multiply (2)" width="90" x="380" y="442"/>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope this helps!

     

    Cheers

    Jan

Sign In or Register to comment.