Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"[SOLVED] sampling a number of examples from different groups"

jan87jan87 Member Posts: 14 Contributor II
edited June 2019 in Help
Dear community,

is it possible to make a sample of let's say 50 examples from every from different groups, that are created through different attributes?

For example i have the attribute a with values 1, 2 and 3 and attribute b with values 1, 2 and 3. The groups that are built through the different combinations have a different amount of data. How can i get a sample with the same amount of examples from every group.

I already tried to use the multiply operator and then different filter operator, but i have so many groups, that this would take days to build...

Thanks for your help
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    your base idea is good and you can follow it: filter the example set by groups with the help of Filter Examples, apply the sampling, and then append the data from all groups.
    A chain of Loop Values operators will prevent you from creating the filter for each group manually. This process is still not trivial, but once setup, you can even add new groups to your data without the need to update the process.

    Best, Marius
  • jan87jan87 Member Posts: 14 Contributor II
    Hi,

    would you perhaps give me a small example how i can use this loop value operator for this problem as i do not understand how to use it...

    thanks
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Here you go! Please note the use of the iteration macros in the Filter Examples operators.
    The aggregation operator in the end is only to prove that you have 3 examples of each combination of att1 and att2.
    You will get problems if a group contains less than (in this case) 3 examples. You could use the Branch operator to check that you have at least group_size examples and only apply the sampling in that case.

    Down there you'll find the code.

    All the best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="352" width="718">
          <operator activated="true" class="generate_nominal_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Nominal Data" width="90" x="112" y="30">
            <parameter key="number_examples" value="1000"/>
          </operator>
          <operator activated="true" class="loop_values" compatibility="5.3.000" expanded="true" height="76" name="Loop Values" width="90" x="313" y="30">
            <parameter key="attribute" value="att1"/>
            <parameter key="iteration_macro" value="v1"/>
            <process expanded="true" height="370" width="736">
              <operator activated="true" class="loop_values" compatibility="5.3.000" expanded="true" height="76" name="Loop Values (2)" width="90" x="246" y="30">
                <parameter key="attribute" value="att2"/>
                <parameter key="iteration_macro" value="v2"/>
                <process expanded="true" height="370" width="736">
                  <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="att1=%{v1}"/>
                  </operator>
                  <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples (2)" width="90" x="313" y="30">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="att2=%{v2}"/>
                  </operator>
                  <operator activated="true" class="sample" compatibility="5.3.000" expanded="true" height="76" name="Sample" width="90" x="447" y="30">
                    <parameter key="sample_size" value="3"/>
                    <list key="sample_size_per_class"/>
                    <list key="sample_ratio_per_class"/>
                    <list key="sample_probability_per_class"/>
                  </operator>
                  <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
                  <connect from_op="Filter Examples" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
                  <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample" to_port="example set input"/>
                  <connect from_op="Sample" from_port="example set output" to_port="out 1"/>
                  <portSpacing port="source_example set" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="example set" to_op="Loop Values (2)" to_port="example set"/>
              <connect from_op="Loop Values (2)" from_port="out 1" to_port="out 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.000" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
          <operator activated="true" class="aggregate" compatibility="5.3.000" expanded="true" height="76" name="Aggregate" width="90" x="581" y="30">
            <list key="aggregation_attributes">
              <parameter key="label" value="count"/>
            </list>
            <parameter key="group_by_attributes" value="|att1|att2"/>
          </operator>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Loop Values" to_port="example set"/>
          <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • jan87jan87 Member Posts: 14 Contributor II
    Hi Marius,

    thank you very much for your very helpful example!

    It's great you can solve this problem with RM, for which even SPSS seems not to have a solution...

Sign In or Register to comment.