undersampling the large class

PPPP Member Posts: 9 Contributor II
edited November 2018 in Help
How can i  undersampling the large class in my data

Answers

  • earmijoearmijo Member Posts: 270 Unicorn
    Split the original dataset ( Positive vs Negative class). Keep all cases of the rare class and sample from the frequent class.

    Here's some code that undersamples the frequent class (class=yes) :
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.5.002" expanded="true" height="60" name="Retrieve Golf" width="90" x="45" y="75">
            <parameter key="repository_entry" value="//Samples/data/Golf"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.5.002" expanded="true" height="94" name="Multiply" width="90" x="179" y="75"/>
          <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples (2)" width="90" x="313" y="210">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Play.equals.no"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="313" y="30">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Play.equals.yes"/>
            </list>
          </operator>
          <operator activated="true" class="sample" compatibility="6.5.002" expanded="true" height="76" name="Sample" width="90" x="447" y="30">
            <parameter key="sample_size" value="5"/>
            <list key="sample_size_per_class"/>
            <list key="sample_ratio_per_class"/>
            <list key="sample_probability_per_class"/>
          </operator>
          <operator activated="true" class="append" compatibility="6.5.002" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
          <connect from_op="Retrieve Golf" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Sample" to_port="example set input"/>
          <connect from_op="Sample" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • PPPP Member Posts: 9 Contributor II
    Many tanks for your help. I did this technique but my yes prediction in the confusion matrix is only 30%. I need better technique for over/under sampling.
    Paulo Praca
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    are you sure that this is because of class balance?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • PPPP Member Posts: 9 Contributor II
    No I'm not sure, I'm a civil engineer and i have a poor knowledge in data mining. In my municipality the water and sewer services produce in a daily basis information about the networks, and I tink data mining techniques could be a key to better understand why and where the failures occurred.

    I could send you my example data if you are interested.

    Thanks for your answer,
    Paulo Praça
  • PPPP Member Posts: 9 Contributor II
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Desentupimentos" width="90" x="45" y="187">
            <parameter key="repository_entry" value="//Local Repository/data/Desentupimentos"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="7.0.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="187">
            <parameter key="attribute_name" value="Obstrucoes"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="187">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Obstrucoes|ANO_INSTALACAO|COD_MATERIAL|COMP|SECCAO|SISTEMA|TIPO_REDE"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="187">
            <parameter key="parameter_expression" value="!  ((missing([ANO_INSTALACAO])))"/>
            <parameter key="condition_class" value="expression"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="COD_MATERIAL.does_not_equal.NC"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="581" y="187">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="SECCAO.ne.0\.0"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (3)" width="90" x="45" y="289">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="COD_MATERIAL.does_not_equal.NC"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (4)" width="90" x="179" y="289">
            <parameter key="parameter_expression" value="COMP&gt;=10"/>
            <parameter key="condition_class" value="expression"/>
            <list key="filters_list"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="7.0.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="289">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="stratified sampling"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="7.0.001" expanded="true" height="103" name="Multiply" width="90" x="45" y="442"/>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (6)" width="90" x="313" y="646">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Obstrucoes.equals.sim"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="7.0.001" expanded="true" height="103" name="Filter Examples (5)" width="90" x="313" y="442">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Obstrucoes.equals.nao"/>
            </list>
          </operator>
          <operator activated="true" class="sample" compatibility="7.0.001" expanded="true" height="82" name="Sample" width="90" x="514" y="442">
            <parameter key="sample_size" value="1800"/>
            <list key="sample_size_per_class"/>
            <list key="sample_ratio_per_class"/>
            <list key="sample_probability_per_class"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.0.001" expanded="true" height="103" name="Append" width="90" x="648" y="544"/>
          <operator activated="true" class="parallel_decision_tree" compatibility="7.0.001" expanded="true" height="82" name="Decision Tree" width="90" x="849" y="544">
            <parameter key="criterion" value="gini_index"/>
            <parameter key="maximal_depth" value="8"/>
            <parameter key="minimal_gain" value="0.001"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="7.0.001" expanded="true" height="82" name="Apply Model" width="90" x="983" y="238">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.0.001" expanded="true" height="82" name="Performance" width="90" x="1050" y="544"/>
          <connect from_op="Retrieve Desentupimentos" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Filter Examples (3)" to_port="example set input"/>
          <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Filter Examples (4)" to_port="example set input"/>
          <connect from_op="Filter Examples (4)" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Filter Examples (5)" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Filter Examples (6)" to_port="example set input"/>
          <connect from_op="Filter Examples (6)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Filter Examples (5)" from_port="example set output" to_op="Sample" to_port="example set input"/>
          <connect from_op="Sample" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
          <background height="219" location="//Samples/Tutorials/Basics/06/tutorial6" width="2000" x="12" y="12"/>
        </process>
      </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi Paulo,

    it is cool to see, that another civil engineer is working with RM. One of our Sales Engineers is actually civil engineer as well. Thomas Ott aka neuralmarkettrends (on twitter or youtube).

    I think that this problem is indeed a good use case for data mining. I think that the usual points to look at are algorithm, Feature Selection and Feature Generation. Of course you can post data here and we as the community have a look on it.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.