RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Batch ID Generation

varunm1varunm1 Moderator, Member Posts: 1,197   Unicorn
edited December 2019 in Help
Hello,

Is there a simple way to generate batch id (Divide samples into 5 groups) based on ID column in a dataset. For example, I have a dataset with 400 samples related to 30 subjects (Multiple samples per subject). I would like to divide the data set into 5 (this can be any value) batches based on the Subject (not samples), so each batch will have 6 subjects related data. 

I attached a sample dataset where "Subject ID" is the ID column.
Regards,
Varun
https://www.varunmandalapu.com/

Be Safe. Follow precautions and Maintain Social Distancing

Tagged:

Best Answer

Answers

  • varunm1varunm1 Moderator, Member Posts: 1,197   Unicorn
    Hello @sgenzer

    Thanks for your response. Is this scalable to ID with characters like (A, B,C D,... etc)? 

    I just replaced Subject ID column values to characters (Instead of 1 now its A).

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,462  RM Data Scientist
    Just do Nominal to numerical before to get unique integers?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    varunm1lionelderkrikorsgenzerIngoRM
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,068   Unicorn
    Hi Varun, 

    I'm not sure to understand, but is it what you are looking for ..?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.5.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
            <parameter key="excel_file" value="C:\Users\Lionel\Downloads\Batch_Sample_Data.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Subject ID.true.integer.attribute"/>
              <parameter key="1" value="Data.true.real.attribute"/>
              <parameter key="2" value="Data_1.true.real.attribute"/>
              <parameter key="3" value="Data_2.true.integer.attribute"/>
              <parameter key="4" value="Data_3.true.real.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.2.000" expanded="true" height="82" name="Group Into Collection" width="90" x="313" y="34">
            <parameter key="group_by_attribute" value="Subject ID"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Regards,

    Lionel



    sgenzer
  • varunm1varunm1 Moderator, Member Posts: 1,197   Unicorn
    edited December 2019
    @mschmitz yep it works, I need to duplicate the column and then convert it to nominal to numerical. The reason for duplication is to trace back the ID's 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    mschmitzsgenzerIngoRM
Sign In or Register to comment.