Options

Batch ID Generation

varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
edited December 2019 in Help
Hello,

Is there a simple way to generate batch id (Divide samples into 5 groups) based on ID column in a dataset. For example, I have a dataset with 400 samples related to 30 subjects (Multiple samples per subject). I would like to divide the data set into 5 (this can be any value) batches based on the Subject (not samples), so each batch will have 6 subjects related data. 

I attached a sample dataset where "Subject ID" is the ID column.
Regards,
Varun
https://www.varunmandalapu.com/

Be Safe. Follow precautions and Maintain Social Distancing

Tagged:

Best Answer

Answers

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @sgenzer

    Thanks for your response. Is this scalable to ID with characters like (A, B,C D,... etc)? 

    I just replaced Subject ID column values to characters (Instead of 1 now its A).

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist
    Just do Nominal to numerical before to get unique integers?

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi Varun, 

    I'm not sure to understand, but is it what you are looking for ..?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.5.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
            <parameter key="excel_file" value="C:\Users\Lionel\Downloads\Batch_Sample_Data.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Subject ID.true.integer.attribute"/>
              <parameter key="1" value="Data.true.real.attribute"/>
              <parameter key="2" value="Data_1.true.real.attribute"/>
              <parameter key="3" value="Data_2.true.integer.attribute"/>
              <parameter key="4" value="Data_3.true.real.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.2.000" expanded="true" height="82" name="Group Into Collection" width="90" x="313" y="34">
            <parameter key="group_by_attribute" value="Subject ID"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Regards,

    Lionel



  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited December 2019
    @mschmitz yep it works, I need to duplicate the column and then convert it to nominal to numerical. The reason for duplication is to trace back the ID's 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

Sign In or Register to comment.