How to do a custom split by indices? (R script is too slow)

MaggiDMaggiD Member Posts: 3 Contributor I
edited November 2018 in Help
I need to split my data into training and test set. This operation should be done automatically so that I can loop it for cross-validation.

I don't want to do stratified sampling, instead I wrote an R script that chooses instances by their group membership, which is computed by a regex expression.
But I can't use this script with the R extension, the script takes forever to execute. The main objective of this script is to ensure that if one instance of this group is selected for the test set, all remaining instances are selected too.

I came up with a quick work-around where I use R (outside of RM) to precompute an example set with all the id's of my known set and then 0 or 1 next to it, to signify if they belong to the test set. I can join this example set with my known set and then use the inTestSet? attribute for filtering the rows for the test set.

Now I wonder if their is a better way. Is there an operator that can filter rows by a given list of indices?


  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist
    Hi Maggi,

    so far I understand your idea you want to split your data sets into several partitions and select a subset for the training process. I attached a process which does select a partition subset using two different ways. One is filtering examples based on a list of values you have to set (a,b,c), the other delivers a specified number of partitions using a random selection. For demonstration purpose I set the global random seed to -1 (which is a time dependency). If you use the whole thing inside a loop please set global random seed to a fixed value to be able to reproduce your process results.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
     <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
       <parameter key="random_seed" value="-1"/>
       <process expanded="true">
         <operator activated="true" class="retrieve" compatibility="6.0.008" expanded="true" height="60" name="Retrieve Iris" width="90" x="112" y="30">
           <parameter key="repository_entry" value="//Samples/data/Iris"/>
         <operator activated="true" class="filter_examples" compatibility="6.0.008" expanded="true" height="94" name="Fixed Partitions" width="90" x="313" y="30">
           <parameter key="parameter_string" value="85"/>
           <parameter key="parameter_expression" value="contains(&quot;Iris-setosa,Iris-versicolor&quot;,label)"/>
           <parameter key="condition_class" value="expression"/>
           <list key="filters_list"/>
         <operator activated="true" class="retrieve" compatibility="6.0.008" expanded="true" height="60" name="Retrieve Iris (2)" width="90" x="45" y="255">
           <parameter key="repository_entry" value="//Samples/data/Iris"/>
         <operator activated="true" class="set_macros" compatibility="6.0.008" expanded="true" height="76" name="Parameters" width="90" x="179" y="255">
           <list key="macros">
             <parameter key="target" value="label"/>
             <parameter key="partitions" value="2"/>
         <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="Random Partitions" width="90" x="313" y="255">
           <process expanded="true">
             <operator activated="true" class="multiply" compatibility="6.0.008" expanded="true" height="94" name="Multiply" width="90" x="45" y="30"/>
             <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes" width="90" x="112" y="165">
               <parameter key="attribute_filter_type" value="single"/>
               <parameter key="attribute" value="%{target}"/>
               <parameter key="include_special_attributes" value="true"/>
             <operator activated="true" class="remove_duplicates" compatibility="6.0.008" expanded="true" height="76" name="Remove Duplicates" width="90" x="179" y="300">
               <parameter key="include_special_attributes" value="true"/>
             <operator activated="true" class="transpose" compatibility="6.0.008" expanded="true" height="76" name="Transpose" width="90" x="246" y="165"/>
             <operator activated="true" class="select_by_random" compatibility="6.0.008" expanded="true" height="76" name="Select by Random" width="90" x="313" y="300">
               <parameter key="use_fixed_number_of_attributes" value="true"/>
               <parameter key="number_of_attributes" value="%{partitions}"/>
             <operator activated="true" class="transpose" compatibility="6.0.008" expanded="true" height="76" name="Transpose (2)" width="90" x="380" y="165"/>
             <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes (2)" width="90" x="447" y="300">
               <parameter key="attribute_filter_type" value="single"/>
               <parameter key="attribute" value="label"/>
               <parameter key="include_special_attributes" value="true"/>
             <operator activated="true" class="join" compatibility="6.0.008" expanded="true" height="76" name="Join" width="90" x="581" y="30">
               <parameter key="use_id_attribute_as_key" value="false"/>
               <list key="key_attributes">
                 <parameter key="%{target}" value="%{target}"/>
             <connect from_port="in 1" to_op="Multiply" to_port="input"/>
             <connect from_op="Multiply" from_port="output 1" to_op="Join" to_port="left"/>
             <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes" to_port="example set input"/>
             <connect from_op="Select Attributes" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
             <connect from_op="Remove Duplicates" from_port="example set output" to_op="Transpose" to_port="example set input"/>
             <connect from_op="Transpose" from_port="example set output" to_op="Select by Random" to_port="example set input"/>
             <connect from_op="Select by Random" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
             <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
             <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
             <connect from_op="Join" from_port="join" to_port="out 1"/>
             <portSpacing port="source_in 1" spacing="0"/>
             <portSpacing port="source_in 2" spacing="0"/>
             <portSpacing port="sink_out 1" spacing="0"/>
             <portSpacing port="sink_out 2" spacing="0"/>
         <connect from_op="Retrieve Iris" from_port="output" to_op="Fixed Partitions" to_port="example set input"/>
         <connect from_op="Fixed Partitions" from_port="example set output" to_port="result 1"/>
         <connect from_op="Retrieve Iris (2)" from_port="output" to_op="Parameters" to_port="through 1"/>
         <connect from_op="Parameters" from_port="through 1" to_op="Random Partitions" to_port="in 1"/>
         <connect from_op="Random Partitions" from_port="out 1" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="90"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
Sign In or Register to comment.