RapidMiner

RapidMiner

Stratification: How to get the same number of examples for each class?

Regular Contributor

Stratification: How to get the same number of examples for each class?

I have a data set of 2 labels, label A(6000 items), label B(500items).
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.

So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.

Thanks in advance for your support Smiley Happy

John Quest
14 REPLIES
Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

Hi,

There is no need to repeat your question. What is the difference between doing what you describe and using standard XValidation with stratified sampling, applied on an example set with 50% label A and 50% label B? If you post your XML people will take more interest.

Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

my set up is as follows, I am wondering how to make operator "sample" automatically set the sample size according to the size of operator "filter sample" the one use parameter setting correctness=correct

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="386" width="681">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="38" y="77">
        <parameter key="repository_entry" value="../data talbe/157000_85"/>
      </operator>
      <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="back_freq|back_avg_distance|candidate_len|freq_keyword|snippets|suppE|suppC|keyword_id_ch|correctness|roverd|ranking|dis|lift|front_freq"/>
      </operator>
      <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
        <process expanded="true" height="431" width="373">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="112" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="correctness=wrong"/>
          </operator>
          <operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="246" y="30">
            <parameter key="sample_size" value="5661"/>
          </operator>
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="165">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="correctness=correct"/>
          </operator>
          <operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="246" y="165"/>
          <operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="300"/>
          <connect from_port="training" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="414" width="373">
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="51" y="43">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="227" y="44">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 2"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
RMStaff

Re: Stratification: How to get the same number of examples for each class?

Hi,

this is clearly going far beyond of the scope of this board (and actually also of this forum). A process like this isn't made within a minute.

However, I have created a process for the desired task and uploaded it with the Community Extension of RapidMiner under the name "Same Number of Examples per Class (Stratification; Loops and Macros)". Just download and install the Community Extension and search for the process (search in this forum for more information, some infos can also be found in my signature below).

Cheers,
Ingo
Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

Greetings O Pointy One,

You beat me to it! Drat ! Can we not have a badge/smiley pointing folks there, lest we have to repeat ( this exact one of balancing data comes up repeatedly ).

RMStaff

Re: Stratification: How to get the same number of examples for each class?

I might have been faster but the solution can still be optimized  ;D A good idea would be to extract the label automatically without having the user define it via a macro. The second thing is that I loose one example in the minority class  :Smiley Happy

Anyway, I moved the discussion into this board here and made it also sticky so that we can easily link to this one in future.

Cheers,
Ingo
Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

Hi,

I think this covers the points you made - must say I found the 'Append' operator placement a challenge, still it does show the world of collections at work..

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="335" width="791">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="120">
        <parameter key="macro" value="exs"/>
      </operator>
      <operator activated="true" class="loop_values" expanded="true" height="76" name="Loop Values" width="90" x="313" y="120">
        <parameter key="attribute" value="class"/>
        <process expanded="true" height="453" width="809">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="141" y="94">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="class=%{loop_value}"/>
          </operator>
          <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro (2)" width="90" x="313" y="75">
            <parameter key="macro" value="subexs"/>
          </operator>
          <operator activated="true" class="generate_macro" expanded="true" height="76" name="Generate Macro" width="90" x="447" y="75">
            <list key="function_descriptions">
              <parameter key="exs" value="min(%{subexs},%{exs})"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Extract Macro (2)" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_collection" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="120">
        <parameter key="unfold" value="true"/>
        <parameter key="parallelize_iteration" value="true"/>
        <process expanded="true" height="353" width="809">
          <operator activated="true" class="sample" expanded="true" height="76" name="Sample" width="90" x="269" y="53">
            <parameter key="sample_size" value="%{exs}"/>
          </operator>
          <connect from_port="single" to_op="Sample" to_port="example set input"/>
          <connect from_op="Sample" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" expanded="true" height="76" name="Append" width="90" x="581" y="120"/>
      <connect from_op="Retrieve" from_port="output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>



Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

Thanks, I will try it out

John
Regular Contributor

Re: Stratification: How to get the same number of examples for each class?

Dear All
            I still having some problem understand the last XML post by haddock, I cannot connect the macros to two outputs.
            My question is still regarding my XML post on 10 June, I make it simpler and only looking at the problem this time, please see the attached XML codes.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="396" width="779">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
        <parameter key="repository_entry" value="//Project CE/cep8/data talbe/157000_85"/>
      </operator>
      <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="correctness=wrong"/>
      </operator>
      <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="165">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="correctness=correct"/>
      </operator>
      <operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="380" y="30">
        <parameter key="sample_size" value="1662"/>
      </operator>
      <operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
      <connect from_op="Retrieve" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
            We want the operator "sample_stratified" take the exact amount according to the number of examples from "filter_examples 1" value="correctness=correct". Any ideas, thanks in advance for your support.


John
RMStaff

Re: Stratification: How to get the same number of examples for each class?

Did you try the process I have uploaded with the Community Extension? Could help here...

Cheers,
Ingo