🎉 🎉 RAPIDMINER 9.5 BETA IS OUT!!! 🎉 🎉
GRAB THE HOTTEST NEW BETA OF RAPIDMINER STUDIO, SERVER, AND RADOOP. LET US KNOW WHAT YOU THINK!
CLICK HERE TO DOWNLOAD
🦉 🎤 RapidMiner Wisdom 2020  CALL FOR SPEAKERS 🦉 🎤
We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.
Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!
CLICK HERE TO GO TO ENTRY FORM
How can compare decision tree and linear regression using CrossValidated or XValidated?
Attached are some relavent pictures of my set up and stats on the target variable:
1) The Setup of my .rmp , 2) Picture of the Histogram of the Target Variable, 3) Plot of CO2 (target variable) vs. Primary Principle Component
I best compare models with Cross Validation to figure out especially between categorical models like decision trees vs. numerical models like linear regression. I have been learning about cross validation in my Rapidminer class, but I am not 100% sure what exactly accuracy, precision, and recall are for classification prediction and regression prediction operators. For example, I would like to use precision and class recall to compare models, but I don't know what they might be for regression because the confusion matrix is based on nomial label not a numerical label.
So How can compare decision tree and linear regression using CrossValidated or XValidated? What statistic or metric could I use?
Below are the stats output from my results:
Target Variable Stats
CO2 Emissions Average: 87,405.93 Deviation: 628 363.80
Performance of Linear Regression:
root_mean_squared_error: 34,017.261 +/ 5548.473 (mikro: 34467.846 +/ 0.000)
normalized_absolute_error: 0.151 +/ 0.040 (mikro: 0.140)
Performance of Decision Tree:
accuracy: 90.65% +/ 4.13% (mikro: 90.64%)
root_mean_squared_error: 0.282 +/ 0.063 (mikro: 0.289 +/ 0.000)
normalized_absolute_error: 3.143 +/ 5.800 (mikro: 1.209)
Avg. Class Precision: 62.1%
Avg. Class Recall: 68%
Performance of Decision Random Forest:
accuracy: 82.14% +/ 1.92% (mikro: 82.14%)
root_mean_squared_error: 0.392 +/ 0.025 (mikro: 0.393 +/ 0.000)
normalized_absolute_error: 1.017 +/ 0.117 (mikro: 0.993)
Avg. Class Precision: 86.3%
Avg. Class Recall: 40%
Performance of Neural Network:
root_mean_squared_error: 23,815.976 +/ 4305.543 (mikro: 24211.353 +/ 0.00)
normalized_absolute_error: 0.126 +/ 0.037 (mikro: 0.122)
Performance of General Linearized Model (Default values):
root_mean_squared_error: 21,9027.497 +/ 45537.878
normalized_absolute_error: 1.017 +/ 0.117 (mikro: 0.993)
Best Answer

JEdward Posts: 564 Unicorn
What you might want to do is transform your Performance for the regression into a classification result by Discretizing the Label & Prediction variables with the same rules you applied for you Domain Expert defined bins.
Then you are comparing like for like.
However
One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications.
Imagine you have a numerical label with values 1 to 10.
After binning your label has the following nominal values.
Value 1: 13
Value 2: 46
Value 3: 79
Value 4: 10
Now if your classification model predicts something with an original numeric value of 3 and it predicts that it is in group 'Value 2: 46', then although this is a misclassification it is actually more accurate than if it had predicted 'Value 4: 10'. However, just looking purely at Accuracy, Precision & Recall won't reflect this. Both misclassifications as 'Value 4:10' and 'Value 2:46' have the same performance value 0... which is just not correct.
I would recommend that you use the Performance (Costs) operator and create a misclassification costs matrix. That way you can reflect that misclassifications in nearby groups are 'less costly' than those in more distance groups.
1
Answers
Thanks, I eventually realized that it really is like trying to compare fruit and vegitables.