Linear Regression: error in calculation of tolerance

dhamptondhampton Member Posts: 14 Contributor II
edited November 2018 in Help

I am writing training materials for multiple regression.  The Linear Regression Operator is giving what seems to be incorrect calculations for tolerance.

 

 To illustrate, see attached toy dataset. My process reads this data and uses Linear Regression to do y=f(x1, x2, x3, x4). The model is then applied to the training data (just to keep things simple) and finally I use Performance to get R-squared. The result is:

 

Attribute     Coefficient                    Standard Error            Std. Coefficient             Tolerance                 t-stat                          p-value                     code

X1 0.6099442233747938 0.097076731571145 0.8324180612316422 0.4913830335394965 6.283114537367604 1.4384283423596322E-4 ****
X2 -2.8474043342377822E-8 1.9598479705266512E-7 -0.028568714232080603 0.40108726248304105 0.0 1.0  
X3 0.178312419929975 0.0821213306746008 0.7990271382036194 0.4534020133333492 2.1713289161925995 0.05798784094691456 *
X4 -0.0010830494516547503 7.82512989580685E-4 -0.49206399607097406 0.262094151203384 -1.3840657804736376 0.19969313341637596  
(Intercept) -0.3277299280807463 0.161204140113176 NaN NaN -2.033011855965102 0.07258034063737584 *

 

I cross check the results with Minitab and RapidMiner and Minitab agree on everything except tolerance.  Minitab reports VIFs but they are simply the reciprocal of tolerance.  Here is the Minitab output

Term            Coef          SE Coef        T-Value       P-Value      VIF
Constant     -0.328        0.161             -2.03          0.073
x1               0.6099        0.0971           6.28           0.000         2.53
x2               -0.000000   0.000000       -0.15         0.888         5.58
x3               0.1783        0.0821           2.17           0.058       19.54
x4               -0.001083  0.000783       -1.38           0.200      18.24

 

The VIFs are a long way from the reciprocals of the tolerances.

 

I calculated the values directly: tolerance = 1-R-sq, where R-sq is obtained by regressing the x against all the other xs.  So for example if I drop the y and make x4 the label and re-run the process, I get an R-sq of 94.5% and the tolerance for x4 should therefore be 0.055, not 0.262

 

Am I going wrong, or is it an error?

 

Many thanks

 

David Hampton

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hey David,

     

    i've dived into the code and saw no real issue except forpossible numeric instabilities. Did you check to normalize first and compare the results?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Or were there any other parameters modified (e.g. ridge regression value) that might be affecting the calculation?  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • dhamptondhampton Member Posts: 14 Contributor II

    Many thanks for your prompt reply Martin.  I have checked this... normalizing changes all the coefficients and their standard errors, as you would expect, but does not affect tolerances (or p-values for that matter) so it's not being caused by that.

    Attribute                       Coefficient                     Standard Error           Std. Coefficient              tolerance                      t-stat                             p-value

    X1

    1.8298325346724833

    0.29123019471343986

    0.8324179996125617

    0.4913830514888129

    6.283114072264977

    1.438429138445052E-4

    ****

    X2

    -0.04666048376439675

    0.321161266854193

    -0.028568669272726246

    0.40108725714143556

    -0.14528677203649398

    0.8876862107223876

     

    X3

    1.2481866266339015

    0.5748493147222091

    0.7990269379160296

    0.4534020060669054

    2.1713283719179115

    0.05798789235186819

    *

    X4

    -0.9021798318881504

    0.6518333203207141

    -0.4920637989900229

    0.2620941472452639

    -1.3840652261290682

    0.1996932973423048

     

    (Intercept)

    0.3846904324227236

    0.1089314893504385

    NaN

    NaN

    3.531489697943573

    0.0063989680350855505

    ***

     

    A simple check to see if there is indeed something wrong is to directly calculate the tolerance: I re-ran the regression model without y and instead made x4 the label.  This directly calculates the R-sq of x4 against all the other attributes.  I get an r-squared of 0.954 and from that I can calculate that the tolerance of X4 should be 1-0.954 = 0.046 ... a long way from the figure RapidMiner gives, of 0.262.

     

    Thanks for your patience  with this...

     

    David

  • dhamptondhampton Member Posts: 14 Contributor II

    Thanks Brian

    For training purposes I begin with no feature selection, no elimination of collinear features and no regularisation.  Adding in either feature selection or removal of collinear features sweeps away some of the xs and so masks the problem with the tolerance calculations (but doesn't solve it!)... adding in regularisation makes only a very small difference - even with a ridge of 0.1 the tolerances reduce by only about 15-20% and they are several times too big... so it's not that.

    cheers

    David

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    David,

     

    i've checked the code, which i attach here. It looks super good. I know that our LinReg got benchmarked a lot against e.g. R and went well. Did you compare it to some other tool and are you sure about your VIF interpetation? Maybe @DArnu can help. He got some background here..

     

    ~Martin

     

    	double getTolerance(ExampleSet exampleSet, boolean[] isUsedAttribute, int testAttributeIndex, double ridge,
    boolean useIntercept) throws UndefinedParameterError, ProcessStoppedException {
    List<Attribute> attributeList = new LinkedList<>();
    Attribute currentAttribute = null;
    int resultAIndex = 0;
    for (Attribute a : exampleSet.getAttributes()) {
    if (isUsedAttribute[resultAIndex]) {
    if (resultAIndex != testAttributeIndex) {
    attributeList.add(a);
    } else {
    currentAttribute = a;
    }
    }
    resultAIndex++;
    }

    Attribute[] usedAttributes = new Attribute[attributeList.size()];
    attributeList.toArray(usedAttributes);

    double[] localCoefficients = performRegression(exampleSet, usedAttributes, currentAttribute, ridge);
    double[] attributeValues = new double[exampleSet.size()];
    double[] predictedValues = new double[exampleSet.size()];
    int eIndex = 0;
    for (Example e : exampleSet) {
    attributeValues[eIndex] = e.getValue(currentAttribute);
    int aIndex = 0;
    double prediction = 0.0d;
    for (Attribute a : usedAttributes) {
    prediction += localCoefficients[aIndex] * e.getValue(a);
    aIndex++;
    }
    if (useIntercept) {
    prediction += localCoefficients[localCoefficients.length - 1];
    }
    predictedValues[eIndex] = prediction;
    eIndex++;
    }

    double correlation = MathFunctions.correlation(attributeValues, predictedValues);
    double tolerance = 1.0d - correlation * correlation;
    return tolerance;
    }
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • dhamptondhampton Member Posts: 14 Contributor II

    Many thanks Martin.

     

    I have checked using R with the car package to get VIFs.  The coefficients stack up exactly with RapidMiner and R gives the same VIFs as Minitab (ie, contradicting RapidMiner)

     

    Here's my R output:

     

    > summary(book1Model)

    Call:
    lm(formula = Y ~ ., data = trial)

    Residuals:
    Min 1Q Median 3Q Max
    -0.18858 -0.03629 -0.01287 0.02995 0.38796

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) -3.277e-01 1.612e-01 -2.033 0.072580 .
    X1 6.099e-01 9.708e-02 6.283 0.000144 ***
    X2 -2.847e-08 1.960e-07 -0.145 0.887686
    X3 1.783e-01 8.212e-02 2.171 0.057988 .
    X4 -1.083e-03 7.825e-04 -1.384 0.199693
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1671 on 9 degrees of freedom
    Multiple R-squared: 0.9376, Adjusted R-squared: 0.9099
    F-statistic: 33.82 on 4 and 9 DF, p-value: 1.973e-05

    > vif(book1Model)
    X1 X2 X3 X4
    2.532610 5.579088 19.539216 18.237488

     

    So assuming that RapidMiner's code is OK, something must be wrong with my Linear Regression operator. I deleted and replaced it, no change.

    For clarity the parameter setting I am using are:

    Feature selection: none

    Do not eliminate collinear features

    Use bias

    Ridge 0

     

    I believe this should get me equivalent output to R and Minitab.  Still the same error.  I must be doing something wrong but feel that I have pretty much exhausted the possibilities!

     

    thanks

     

    David

Sign In or Register to comment.