🎉 🎉   RAPIDMINER 9.5 BETA IS OUT!!!   🎉 🎉

GRAB THE HOTTEST NEW BETA OF RAPIDMINER STUDIO, SERVER, AND RADOOP. LET US KNOW WHAT YOU THINK!

CLICK HERE TO DOWNLOAD

🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"Why doesn't Split-Data Inherit the Global Random Seed?"

Panj1Panj1 Member Posts: 2 Contributor I
edited June 7 in Help

I retrieved the Titanic dataset, than multipied it, and Copy and paste 3x split data operators at a 0.7/0.3 split. The data results are different each time. Now, I can set the local random seed to something in order to make sure it splits exactly the same each time, but I would've expected in this case that it inherits the Global Random Seed by default. Is this expected behavior? It seems unintuitive if it is.  

 

If it is expected behavior, is there an option somewhere to force randomization operators to use a global random seed? 

 

I am using RapidMiner Studio 8.2.

 

Titanic Split Data Test.png

Thank You,

 

Please see XML below. 

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="340">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="124" name="Multiply" width="90" x="179" y="391"/>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (4)" width="90" x="447" y="595">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (3)" width="90" x="447" y="442">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (2)" width="90" x="447" y="289">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic (2)" width="90" x="45" y="748">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (5)" width="90" x="313" y="748">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Titanic (3)" width="90" x="45" y="850">
<parameter key="repository_entry" value="//Samples/data/Titanic"/>
</operator>
<operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="82" name="Split Data (6)" width="90" x="313" y="850">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Split Data (2)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Split Data (3)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Split Data (4)" to_port="example set"/>
<connect from_op="Split Data (4)" from_port="partition 1" to_port="result 3"/>
<connect from_op="Split Data (3)" from_port="partition 1" to_port="result 2"/>
<connect from_op="Split Data (2)" from_port="partition 1" to_port="result 1"/>
<connect from_op="Retrieve Titanic (2)" from_port="output" to_op="Split Data (5)" to_port="example set"/>
<connect from_op="Split Data (5)" from_port="partition 1" to_port="result 4"/>
<connect from_op="Retrieve Titanic (3)" from_port="output" to_op="Split Data (6)" to_port="example set"/>
<connect from_op="Split Data (6)" from_port="partition 1" to_port="result 5"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

Tagged:

Best Answer

  • jczogallajczogalla Posts: 125   RM Engineering
    Solution Accepted

    Hi Panj1!

    Welcome to the community. :) As a tip, you can use the "</>" button while writing your post to have a nice formatted version of your XML.This prevents conversion of part of the XML to smilies for example.

     

    Regarding the random generator question: The operators of course use the global random generator by default, but since it is the global random generator, it will progress with each operator that uses it. This means that as long as you keep the execution order the same, the end results will stay the same between process executions. But if you want two split operators to produce the same partitions, those two need to have the same local random seed. This is also the case for loops.

    If you just want to split the same data set multiple times the same way, you can also use the split operator once and multiply its outputs, example XML below.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.3.000-SNAPSHOT">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.3.000-SNAPSHOT" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.3.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="340">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="8.3.000-SNAPSHOT" expanded="true" height="103" name="Split Data" width="90" x="179" y="340">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.7"/>
    <parameter key="ratio" value="0.3"/>
    </enumeration>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.3.000-SNAPSHOT" expanded="true" height="82" name="Multiply" width="90" x="380" y="340"/>
    <operator activated="true" class="multiply" compatibility="8.3.000-SNAPSHOT" expanded="true" height="82" name="Multiply (2)" width="90" x="380" y="442"/>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope this helps!

     

    Cheers

    Jan

Sign In or Register to comment.