Hi! Please how good is Decision Tree in Regression?

Jerwuney
Jerwuney New Altair Community Member
edited November 2024 in Community Q&A
I have used the Decision Tree Regression and other regression models (SVR, LR, ANN, GBT, RFR etc.) on my data, and the former is performing better than all.

I also took a new set of data for test, and the decision tree still performed better. 

But I have read about Decision Trees having overfitting problems, can I keep my results as a good one or the problem could really be overfitting?
Thank you


<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="false" class="shuffle" compatibility="9.10.001" expanded="true" height="82" name="Shuffle" width="90" x="45" y="136">
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="9.10.001" expanded="true" height="68" name="Retrieve None_Updated.xlsx" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../Data/None_Updated.xlsx"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.10.001" expanded="true" height="103" name="Train-test-set" width="90" x="246" y="34">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="train.lt.1"/>
          <parameter key="filters_entry_key" value="Material.eq.1"/>
          <parameter key="filters_entry_key" value="interval.eq.1"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.10.001" expanded="true" height="103" name="Validate-set" width="90" x="112" y="289">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="train.eq.1"/>
          <parameter key="filters_entry_key" value="Material.eq.1"/>
          <parameter key="filters_entry_key" value="interval.eq.1"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.10.001" expanded="true" height="82" name="Set Role (2)" width="90" x="246" y="238">
        <parameter key="attribute_name" value="LeakDist"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.10.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
        <parameter key="attribute_name" value="LeakDist"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="split_validation" compatibility="9.10.001" expanded="true" height="124" name="Validation" width="90" x="581" y="34">
        <parameter key="create_complete_model" value="false"/>
        <parameter key="split" value="relative"/>
        <parameter key="split_ratio" value="0.8"/>
        <parameter key="training_set_size" value="100"/>
        <parameter key="test_set_size" value="-1"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="true"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.10.001" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="34">
            <parameter key="criterion" value="least_square"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.10.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="136">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="9.10.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="true"/>
            <parameter key="prediction_average" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="9.10.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="380" y="340">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="blending:rename" compatibility="9.10.001" expanded="true" height="82" name="Rename" width="90" x="514" y="238">
        <list key="rename attributes">
          <parameter key="prediction(LeakDist)" value="predictedLeakDist"/>
        </list>
        <parameter key="from_attribute" value=""/>
        <parameter key="to_attribute" value=""/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.10.001" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="187">
        <list key="function_descriptions">
          <parameter key="Residuals" value="predictedLeakDist-LeakDist"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <connect from_op="Retrieve None_Updated.xlsx" from_port="output" to_op="Train-test-set" to_port="example set input"/>
      <connect from_op="Train-test-set" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Train-test-set" from_port="original" to_op="Validate-set" to_port="example set input"/>
      <connect from_op="Validate-set" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Rename" to_port="example set input"/>
      <connect from_op="Apply Model (2)" from_port="model" to_port="result 4"/>
      <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="231"/>
      <portSpacing port="sink_result 5" spacing="42"/>
    </process>
  </operator>
</process>

Best Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    I can't test your process because it refers to a local data set. 

    But the setup looks OK. You are doing a split validation; if the data size is not too big, you could change that to a cross validation. That would test multiple models on *all* examples you put into the validation. 

    With the Cross Validation operator the process would be simpler if you grab the test set output from that. Those results are the predicted values in the validation process. 

    Decision trees are prone to overfitting. Prepruning and (post)pruning are meant to counterbalance this problem and they often work well. If you are doing a clean validation, you will get a fair estimation of the model quality. By comparing models with different parameters you will be able to find some that get good validation results and don't look too complex. (Only very complex decision trees that have leaves for very small groups of the incoming example set are overfitted. "Very complex" is of course hard to tell without experience.)

    I always use Optimize Parameters on decision trees in order to find the best parameters for a balanced model (not too simple or complex). There's an example building block in the Community Samples repository: Community Building Blocks/Optimize Decision Tree that could use as a template.

    You might want to try Random Forest in addition to Decision Trees. It is slower and the model is much more complex, but if decision trees work well for you, the random forest might improve your results or make the modes more robust.

    Regards,
    Balázs

  • Jerwuney
    Jerwuney New Altair Community Member
    Answer ✓
    Hi @BalazsBarany

    Thank you.

    My dataset is close to 5000, though during running, I have to split to some categories using Filter Examples operator. 

    And with the Decision Tree, I pre and post pruned to just make sure I didn't have problem with overfitting. Also, the RMSE is bigger than for some of the models, yet they didn't perform better. I used a max tree depth of 10.

    Attached is the data I'm working with. You have to filter this way: 
    Material type 1, interval 1
    Material type 2, interval 1
    material type 1, interval 2
    Material type 1, interval 3

    'train': 0 is for training and testing and '1' is for validating

    And for the trying Random Forest with Decision Tree, do you mean I combine them like an ensemble?

    I hope this will help. Thank you
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    You can simply replace your Decision Tree with a Random Forest operator and check if the results are getting better or not. If not, then you simply go back to the decision tree.

    Regards,
    Balázs

  • Jerwuney
    Jerwuney New Altair Community Member
    Answer ✓
    Hi @BalazsBarany

    Yes, I did that. Decision Tree is still performing better. I used a new dataset from somewhere to test it and Decision Tree is still the favourite. 

    My fear was just the overfitting and I don’t have much experience even though I took the necessary precautions. So I wanted to hear from more experienced users. 

    Regards,
    Jerwuney

Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    I can't test your process because it refers to a local data set. 

    But the setup looks OK. You are doing a split validation; if the data size is not too big, you could change that to a cross validation. That would test multiple models on *all* examples you put into the validation. 

    With the Cross Validation operator the process would be simpler if you grab the test set output from that. Those results are the predicted values in the validation process. 

    Decision trees are prone to overfitting. Prepruning and (post)pruning are meant to counterbalance this problem and they often work well. If you are doing a clean validation, you will get a fair estimation of the model quality. By comparing models with different parameters you will be able to find some that get good validation results and don't look too complex. (Only very complex decision trees that have leaves for very small groups of the incoming example set are overfitted. "Very complex" is of course hard to tell without experience.)

    I always use Optimize Parameters on decision trees in order to find the best parameters for a balanced model (not too simple or complex). There's an example building block in the Community Samples repository: Community Building Blocks/Optimize Decision Tree that could use as a template.

    You might want to try Random Forest in addition to Decision Trees. It is slower and the model is much more complex, but if decision trees work well for you, the random forest might improve your results or make the modes more robust.

    Regards,
    Balázs

  • Jerwuney
    Jerwuney New Altair Community Member
    Answer ✓
    Hi @BalazsBarany

    Thank you.

    My dataset is close to 5000, though during running, I have to split to some categories using Filter Examples operator. 

    And with the Decision Tree, I pre and post pruned to just make sure I didn't have problem with overfitting. Also, the RMSE is bigger than for some of the models, yet they didn't perform better. I used a max tree depth of 10.

    Attached is the data I'm working with. You have to filter this way: 
    Material type 1, interval 1
    Material type 2, interval 1
    material type 1, interval 2
    Material type 1, interval 3

    'train': 0 is for training and testing and '1' is for validating

    And for the trying Random Forest with Decision Tree, do you mean I combine them like an ensemble?

    I hope this will help. Thank you
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    You can simply replace your Decision Tree with a Random Forest operator and check if the results are getting better or not. If not, then you simply go back to the decision tree.

    Regards,
    Balázs

  • Jerwuney
    Jerwuney New Altair Community Member
    Answer ✓
    Hi @BalazsBarany

    Yes, I did that. Decision Tree is still performing better. I used a new dataset from somewhere to test it and Decision Tree is still the favourite. 

    My fear was just the overfitting and I don’t have much experience even though I took the necessary precautions. So I wanted to hear from more experienced users. 

    Regards,
    Jerwuney