question marks in linear regression output

AD2019AD2019 Member, University Professor Posts: 13 University Professor
I ran a linear regression model with 18 independent variables and feature selection turned off.  For some of the independent variables there were question marks for the standard error of the estimate, and therefore for the t-statistic and p-value for the coefficient.  I ran the mode again with feature selection turned on and got the same question marks.  What do these question marks mean?  Thay cannot have anything to do with missing values as the regression would not have run to completion in that case.  I am baffled about what these "?" symbols might mean.  Help..... 

Best Answers

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Can you post your process xml?  Do you have the bias parameter checked in the LR operator or the exclude collinear features?  There are several options that can affect the output.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    Hi, I have attached my process rmp file.  the 'exclude collinear features' is unchecked.  and you are correct about the bias thing.  if 'use bias' is checked, i do not get question marks.  if it is unchecked, i do get question marks.  I did all this with 'feature selection' turned off.  Something else is also strange.  I then turned on feature selection and used T_Test as the selection method with alpha set to 0.05.  I got a solution that included Independent variables with p-value much much higher than 0.05.  I am confused why these IVs were not trimmed from the output. thanks in advance for your help.
  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    by the way, regardless of the cause, I would like to know what the question mark in the regression output is trying to communicate to the user.  does it mean a computational underflow or overflow or a computational error or what?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @AD2019 I'm picking up this thread here. I have your process (thank you) but not the data set - hence I cannot run the process. Can you pls post?
  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    my apologies for this delay in posting the data file.  please see attached.  when i run the regression without bias, I get question marks in the regression model.  What does that mean? the process files was posted earlier (RM-houseprice-process.rmp).  
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @AD2019 do you mean these ? marks?



    So the simple answer is that ? marks are used in RapidMiner when values are missing. The better question is why are they missing...my educated guess here (pls correct me @varunm1 @mschmitz if my stats are wrong here) is that there can be no std coefficient or tolerance for an intercept of a LinReg model as it's a computed value. All of your actual data (the other attributes) have std coefficients which make sense. But my stats are a wee bit rusty so I look to these other smart folks to correct me. :wink:

    Scott

  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    Hi Scott:
    if you run the process with bias turned off, you will get questions marks for some of the independent variables as well, not just the intercept.  Since there is a question mark on the standard error for these variables, the t-statistic and p-values also have question marks on them.  So it is not just an issue of the intercept.  The data set does not have missing values, so I could not figure out what the question marks were trying to say.  The only thing I could think of was numerical overflow or underflow when calculating the standard error of the associated variable, but then I could not see how the coefficients would have been computed.
    Amit
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi Amit -

    Ah I understand. Good point. It's been a while since I've played with all of this (we normally use the GLM modeler instead of LinReg as it is far more versatile and robust). Let me investigate.

    Scott

  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    thanks Scott.  Let me play around with GLM and see if I can get rid of the ?
  • AD2019AD2019 Member, University Professor Posts: 13 University Professor
    thank you Varun.
Sign In or Register to comment.