🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

Problem with overfitting

SimonKSimonK Member Posts: 20 Contributor I
Hello,

I have a problem with overfitting.
It is a classification with 8 label values and 6 attributes with about 5.5 million values each.
By cross validation with 10 folds, my decision tree reaches an accuracy of about 93%. Unfortunately, when I apply the model to new data, I only get a test accuracy of 33%.
Can anyone tell me how to prevent overfitting on the training data?

I have chosen the following parameters for the decision tree:

criterion: information gain
maximum depth: 30
apply pruning: yes
confidence: 0.24
apply prepruning: yes
minimum gain: 0.0
minimum leaf size: 1
minimum size for slit: 1
number of prepruning alternatives: 0

Greetings

Simon

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,128  RM Data Scientist
    Hi,
    are there duplicates or pseudo duplicates in your data?

    Lets say you have production data for items, and items are created in batches. Than two items of the same machine are virtually the same. Cross validation may separate them into train and test set and you 'fool' your validation.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • SimonKSimonK Member Posts: 20 Contributor I
    Hello @mschmitz,

    My project is about combustion. The model is supposed to predict emissions. It may well be that some operating conditions occur more than once.
    Does the Remove Duplicates operator help here?

    Regards

    Simon
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,128  RM Data Scientist
    hard to say. Do you have more than 1  combustion engine/device and your test set is a different engine? That would totally explain it, because your model may have overfitted on the engine.
    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • SimonKSimonK Member Posts: 20 Contributor I
    @mschmitz

    No, it is waste combustion.
    I use the data from 2010 - 2020 as training data and the data from 2021 as test data. 
    I have also tried to train the model with only 2/3 of the training data and test it with the remaining 1/3 (to exclude that something has changed in the process since 2021), but with the same result (the low test accuracy).

    Regards

    Simon
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,128  RM Data Scientist
    Hi,
    maybe have a look at this older blog post of mine: https://towardsdatascience.com/when-cross-validation-fails-9bd5a57f07b5 that could be it.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • SimonKSimonK Member Posts: 20 Contributor I
    Hi @mschmitz

    I have now carried out the cross validation with a batch, but with the same result.
    I have attached my training dataset (1), my test dataset (2) to this and the XML of my process. The 6 attributes (a1-a6) are used to build a model (decision tree) to predict the label. I get a validation accuracy of 92.33% but only a test accuracy of 37%.
    Is there another way to avoid overfitting?

    Regards

    Simon

    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 1" width="90" x="45" y="85">
            <parameter key="repository_entry" value="../data/1"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
            <parameter key="attribute_name" value="label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="model_simulator:generate_batch" compatibility="9.9.000" expanded="true" height="68" name="Generate Batch" width="90" x="313" y="85">
            <parameter key="batch attribute name" value="batch"/>
            <parameter key="number of batches" value="5"/>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.9.000" expanded="true" height="145" name="Cross Validation" width="90" x="447" y="85">
            <parameter key="split_on_batch_attribute" value="true"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.9.000" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="34">
                <parameter key="criterion" value="information_gain"/>
                <parameter key="maximal_depth" value="30"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="1.0E-7"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.0"/>
                <parameter key="minimal_leaf_size" value="1"/>
                <parameter key="minimal_size_for_split" value="1"/>
                <parameter key="number_of_prepruning_alternatives" value="0"/>
              </operator>
              <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.9.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance" compatibility="9.9.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve 2" width="90" x="45" y="238">
            <parameter key="repository_entry" value="../data/2"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.9.000" expanded="true" height="82" name="Set Role (2)" width="90" x="447" y="238">
            <parameter key="attribute_name" value="label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="model_simulator:model_simulator" compatibility="9.9.000" expanded="true" height="103" name="Model Simulator" width="90" x="648" y="34"/>
          <connect from_op="Retrieve 1" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Generate Batch" to_port="example set"/>
          <connect from_op="Generate Batch" from_port="example set" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_op="Model Simulator" to_port="model"/>
          <connect from_op="Cross Validation" from_port="example set" to_op="Model Simulator" to_port="training data"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/>
          <connect from_op="Retrieve 2" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Model Simulator" to_port="test data"/>
          <connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    2.csv 791.2K
    1.csv 10.8M
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,128  RM Data Scientist
    Before we go deeper: Are you sure that your test and train set are stemming from the same distribution?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • SimonKSimonK Member Posts: 20 Contributor I
    Yes, they definitely come from the same distribution.

    Regards 

    Simon
Sign In or Register to comment.