Valdiation of the model and adjusted R Squared

masterandmasterand Member Posts: 1 Newbie
Hey Community,

I have a question regarding the validation of my model (I used the cross validation operator). 
I created a prediction model (label: numeric) and therfore used the algorithms "Linear Regression, "Neural Net" and "Deep Learning". For validation I chose the RMSE, the relative error and the squared correlation (R squared). 

I read that the R squared gets better as more attributes are chosen. To prevent this, I read that the adjusted R squared should be chosen. Is this possible with RapidMiner Studio or is this already the adjusted R squared?

To improve my model I also tested with the "Select attributes" operator and noticed the following:

When I selected all attributes I had this performance:

Case 1
Linear Regression (RMSE 0,8 I Relative Error 14,2 I R squared 0,63)
Neural Net (RMSE 0,86 I RF 15,91 I R squared 0,65)
Deep Learning (RMSE 0,78 I RF 11,72 I R squared 0,68)

In this case the Deep Learning should be the best model.
Now I removed some attributes for modeling and got the following results:

Case 2
Linear Regression (RMSE 0,79 I Relative Error 14,0 I R squared 0,68)
Neural Net (RMSE 0,89 I RF 17,29 I R squared 0,66)
Deep Learning (RMSE 0,79 I RF 13,19 I R squared 0,65)

I do not really know if the Linear Regression in second case is the better model as the Deep Learning in first case (R squared got better but Relative Error got worse). Can somebody help?

Thanks a lot!

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,630 Unicorn
    the Rsq here is not adjusted, but there are several parameters for feature selection in the LR modeling operator that can be used to prevent overfitting.
    As far as what model is best, there is no simple way to answer this question based purely on these performance metrics.  You have to know the use case to understand the tradeoffs between a slightly higher relative error vs Rsq.  It would also probably help to look at the underlying data to determine whether the relationship does look like it should be linear or not.  Additional feature engineering may be helpful in improving all of the model fits.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • balmerhevibalmerhevi Member Posts: 2 Contributor I
    edited June 6
    r-squared refers to the 'goodness' of fit for a particular model with no regard for the number of independent variables. Whereas, adjusted r-squared takes into account the number of independent variables.
    So if you have a regression equation such as
    y = mx + nx1 + ox2 + b
    The r-squared will tell you how well that equation describes your data. If you add more independent variables (p, q, r, s ...) then the r-square value will improve because you are in essence more specifically defining your sample data.  Using adjusted R-squared metric instead takes into account that you have added more independent variables and will 'penalize' the result for the more variables you add which don't fit the sample data. This is a good way to test the variables, either by adding in one at a time and checking when the adj-R2 starts to deteriorate or by starting with all the variables and removing one at a time until the adj-R2 doesn't improve.


Sign In or Register to comment.