turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Re: How can compare decision tree and linear regre...

WIN $750

Compete in RapidMiner's 3rd Competition: Fantasy Football. Top prize is $750. Deadline December 19.

RAPIDMINER 8 BETA

Download RapidMiner Studio or Server 8.0 Public Beta. Let us know how you like it! Ends November 27.

WANT VIDEOS?

Watch RapidMiner's "Getting Started" videos on YouTube. Everything you need to do data science - fast and simple!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

12-09-2016
09:42 PM

12-09-2016
09:49 PM
12-09-2016
09:42 PM

12-09-2016
09:49 PM
Attached are some relavent pictures of my set up and stats on the target variable:

1) The Setup of my .rmp , 2) Picture of the Histogram of the Target Variable, 3) Plot of CO2 (target variable) vs. Primary Principle Component

I best compare models with Cross Validation to figure out especially between categorical models like decision trees vs. numerical models like linear regression. I have been learning about cross validation in my Rapidminer class, but I am not 100% sure what exactly accuracy, precision, and recall are for classification prediction and regression prediction operators. For example, I would like to use precision and class recall to compare models, but I don't know what they might be for regression because the confusion matrix is based on nomial label not a numerical label.

So How can compare decision tree and linear regression using Cross-Validated or X-Validated? What statistic or metric could I use?

Below are the stats output from my results:

Target Variable Stats CO2 Emissions Average: 87,405.93 Deviation: 628 363.80 Performance of Linear Regression: root_mean_squared_error: 34,017.261 +/- 5548.473 (mikro: 34467.846 +/- 0.000) normalized_absolute_error: 0.151 +/- 0.040 (mikro: 0.140) Performance of Decision Tree: accuracy: 90.65% +/- 4.13% (mikro: 90.64%) root_mean_squared_error: 0.282 +/- 0.063 (mikro: 0.289 +/- 0.000) normalized_absolute_error: 3.143 +/- 5.800 (mikro: 1.209) Avg. Class Precision: 62.1% Avg. Class Recall: 68% Performance of Decision Random Forest: accuracy: 82.14% +/- 1.92% (mikro: 82.14%) root_mean_squared_error: 0.392 +/- 0.025 (mikro: 0.393 +/- 0.000) normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993) Avg. Class Precision: 86.3% Avg. Class Recall: 40% Performance of Neural Network: root_mean_squared_error: 23,815.976 +/- 4305.543 (mikro: 24211.353 +/- 0.00) normalized_absolute_error: 0.126 +/- 0.037 (mikro: 0.122) Performance of General Linearized Model (Default values): root_mean_squared_error: 21,9027.497 +/- 45537.878 normalized_absolute_error: 1.017 +/- 0.117 (mikro: 0.993)

Solved! Go to Solution.

2 REPLIES

Highlighted
Solution

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

12-11-2016
12:24 AM

12-11-2016
12:26 AM
12-11-2016
12:24 AM

12-11-2016
12:26 AM
What you might want to do is transform your Performance for the regression into a classification result by Discretizing the Label & Prediction variables with the same rules you applied for you Domain Expert defined bins.

Then you are comparing like for like.

**However**

One caution I would give on your classification prediction is to think about how your classification model is measured against misclassifications.

Imagine you have a numerical label with values 1 to 10.

After binning your label has the following nominal values.

Value 1: 1-3

Value 2: 4-6

Value 3: 7-9

Value 4: 10

Now if your classification model predicts something with an original numeric value of 3 and it predicts that it is in group 'Value 2: 4-6', then although this is a misclassification it is actually more accurate than if it had predicted 'Value 4: 10'. However, just looking purely at Accuracy, Precision & Recall won't reflect this. Both misclassifications as 'Value 4:10' and 'Value 2:4-6' have the same performance value 0... which is just not correct.

I would recommend that you use the **Performance (Costs)** operator and create a misclassification costs matrix. That way you can reflect that misclassifications in nearby groups are 'less costly' than those in more distance groups.

-- Training, Consulting, Sales in China, Hong Kong & Taiwan --

www.RapidMinerChina.com

www.RapidMinerChina.com

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

08-16-2017
02:15 PM

08-16-2017
02:15 PM
Twitter Feed