Hi! Please how good is Decision Tree in Regression?

Jerwuney · December 2021

I have used the Decision Tree Regression and other regression models (SVR, LR, ANN, GBT, RFR etc.) on my data, and the former is performing better than all.

I also took a new set of data for test, and the decision tree still performed better.

But I have read about Decision Trees having overfitting problems, can I keep my results as a good one or the problem could really be overfitting?
Thank you

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">

</context>

</operator>

</operator>

</list>

</operator>

</list>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</operator>

</process>

</operator>

</operator>

</list>

</operator>

</list>

</operator>

</process>

</operator>

</process>

BalazsBaranyRM · December 2021

Hi!

I can't test your process because it refers to a local data set.

But the setup looks OK. You are doing a split validation; if the data size is not too big, you could change that to a cross validation. That would test multiple models on *all* examples you put into the validation.

With the Cross Validation operator the process would be simpler if you grab the test set output from that. Those results are the predicted values in the validation process.

Decision trees are prone to overfitting. Prepruning and (post)pruning are meant to counterbalance this problem and they often work well. If you are doing a clean validation, you will get a fair estimation of the model quality. By comparing models with different parameters you will be able to find some that get good validation results and don't look too complex. (Only very complex decision trees that have leaves for very small groups of the incoming example set are overfitted. "Very complex" is of course hard to tell without experience.)

I always use Optimize Parameters on decision trees in order to find the best parameters for a balanced model (not too simple or complex). There's an example building block in the Community Samples repository: Community Building Blocks/Optimize Decision Tree that could use as a template.

You might want to try Random Forest in addition to Decision Trees. It is slower and the model is much more complex, but if decision trees work well for you, the random forest might improve your results or make the modes more robust.

Regards,
Balázs

Jerwuney · December 2021

Hi @BalazsBarany

Thank you.

My dataset is close to 5000, though during running, I have to split to some categories using Filter Examples operator.

And with the Decision Tree, I pre and post pruned to just make sure I didn't have problem with overfitting. Also, the RMSE is bigger than for some of the models, yet they didn't perform better. I used a max tree depth of 10.

Attached is the data I'm working with. You have to filter this way:
Material type 1, interval 1
Material type 2, interval 1
material type 1, interval 2
Material type 1, interval 3

'train': 0 is for training and testing and '1' is for validating

And for the trying Random Forest with Decision Tree, do you mean I combine them like an ensemble?

I hope this will help. Thank you

BalazsBaranyRM · December 2021

Hi!

You can simply replace your Decision Tree with a Random Forest operator and check if the results are getting better or not. If not, then you simply go back to the decision tree.

Regards,
Balázs

Jerwuney · December 2021

Hi @BalazsBarany

Yes, I did that. Decision Tree is still performing better. I used a new dataset from somewhere to test it and Decision Tree is still the favourite.

My fear was just the overfitting and I don’t have much experience even though I took the necessary precautions. So I wanted to hear from more experienced users.

Regards,
Jerwuney

BalazsBaranyRM · December 2021

Hi!

I can't test your process because it refers to a local data set.

But the setup looks OK. You are doing a split validation; if the data size is not too big, you could change that to a cross validation. That would test multiple models on *all* examples you put into the validation.

With the Cross Validation operator the process would be simpler if you grab the test set output from that. Those results are the predicted values in the validation process.

Decision trees are prone to overfitting. Prepruning and (post)pruning are meant to counterbalance this problem and they often work well. If you are doing a clean validation, you will get a fair estimation of the model quality. By comparing models with different parameters you will be able to find some that get good validation results and don't look too complex. (Only very complex decision trees that have leaves for very small groups of the incoming example set are overfitted. "Very complex" is of course hard to tell without experience.)

I always use Optimize Parameters on decision trees in order to find the best parameters for a balanced model (not too simple or complex). There's an example building block in the Community Samples repository: Community Building Blocks/Optimize Decision Tree that could use as a template.

You might want to try Random Forest in addition to Decision Trees. It is slower and the model is much more complex, but if decision trees work well for you, the random forest might improve your results or make the modes more robust.

Regards,
Balázs

Jerwuney · December 2021

Hi @BalazsBarany

Thank you.

My dataset is close to 5000, though during running, I have to split to some categories using Filter Examples operator.

And with the Decision Tree, I pre and post pruned to just make sure I didn't have problem with overfitting. Also, the RMSE is bigger than for some of the models, yet they didn't perform better. I used a max tree depth of 10.

Attached is the data I'm working with. You have to filter this way:
Material type 1, interval 1
Material type 2, interval 1
material type 1, interval 2
Material type 1, interval 3

'train': 0 is for training and testing and '1' is for validating

And for the trying Random Forest with Decision Tree, do you mean I combine them like an ensemble?

I hope this will help. Thank you

BalazsBaranyRM · December 2021

Hi!

You can simply replace your Decision Tree with a Random Forest operator and check if the results are getting better or not. If not, then you simply go back to the decision tree.

Regards,
Balázs

Jerwuney · December 2021

Hi @BalazsBarany

Yes, I did that. Decision Tree is still performing better. I used a new dataset from somewhere to test it and Decision Tree is still the favourite.

My fear was just the overfitting and I don’t have much experience even though I took the necessary precautions. So I wanted to hear from more experienced users.

Regards,
Jerwuney

Hi! Please how good is Decision Tree in Regression?

Best Answers

Answers

Categories