[Solved]UNBALANCED DATA - Newbie Question

dyneradynera Member Posts: 14 Contributor II
edited November 2018 in Help
Hello All,

I am new to this forum and I have read through previous posts but I'm not understanding the basic steps needed to set up a process to balance data.

I have a label with the following split (97% = Y, 3% = N).  I have used WEKA's "resample" filter in the past which does what I would like to do in RapidMiner.  Essentially you can expand your under-represented value to match your over-represented value.  My questions is, which operator(s) should I use and with which settings?

Sorry for the rookie question,

Paul

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey Paul,

    if you can live with the fact that both classes are sampled with replacement, then you can use the Sample (Bootstrapping) operator with weighted sampling: just assign a higher weight to the minority class, such that it is more likely to be sampled. This is done beforehand with the GenerateAttributes operator. Then the weights attribute must be assigned the role "weight". Please have a look at the attached process for the details and come back here if you have any questions left.

    For alternatives, please have a look at this thread, there is quite some discussion on the topic: http://rapid-i.com/rapidforum/index.php/topic,2190.0.html

    All the best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="386" width="748">
          <operator activated="true" class="subprocess" compatibility="5.3.000" expanded="true" height="76" name="Create imbalanced data" width="90" x="45" y="30">
            <process expanded="true" height="506" width="821">
              <operator activated="true" class="generate_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
                <parameter key="number_examples" value="1000"/>
                <parameter key="number_of_attributes" value="1"/>
                <parameter key="attributes_lower_bound" value="0.0"/>
                <parameter key="attributes_upper_bound" value="1.0"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes" width="90" x="246" y="30">
                <list key="function_descriptions">
                  <parameter key="label" value="if(att1&gt;0.9,1,0)"/>
                </list>
              </operator>
              <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="246" y="30">
            <list key="function_descriptions">
              <parameter key="weight" value="if(label==1,10,1)"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.000" expanded="true" height="76" name="Set Role" width="90" x="380" y="30">
            <parameter key="name" value="weight"/>
            <parameter key="target_role" value="weight"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="sample_bootstrapping" compatibility="5.3.000" expanded="true" height="76" name="Sample (Bootstrapping)" width="90" x="514" y="30">
            <parameter key="sample" value="absolute"/>
            <parameter key="sample_size" value="1000"/>
          </operator>
          <connect from_op="Create imbalanced data" from_port="out 1" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
          <connect from_op="Sample (Bootstrapping)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • dyneradynera Member Posts: 14 Contributor II
    Thanks Marius - Much appreciated!  ;D
Sign In or Register to comment.