RapidMiner

Linear Regression: error in calculation of tolerance

Regular Contributor

Linear Regression: error in calculation of tolerance

I am writing training materials for multiple regression.  The Linear Regression Operator is giving what seems to be incorrect calculations for tolerance.

 

 To illustrate, see attached toy dataset. My process reads this data and uses Linear Regression to do y=f(x1, x2, x3, x4). The model is then applied to the training data (just to keep things simple) and finally I use Performance to get R-squared. The result is:

 

Attribute     Coefficient                    Standard Error            Std. Coefficient             Tolerance                 t-stat                          p-value                     code

X1 0.6099442233747938 0.097076731571145 0.8324180612316422 0.4913830335394965 6.283114537367604 1.4384283423596322E-4 ****
X2 -2.8474043342377822E-8 1.9598479705266512E-7 -0.028568714232080603 0.40108726248304105 0.0 1.0  
X3 0.178312419929975 0.0821213306746008 0.7990271382036194 0.4534020133333492 2.1713289161925995 0.05798784094691456 *
X4 -0.0010830494516547503 7.82512989580685E-4 -0.49206399607097406 0.262094151203384 -1.3840657804736376 0.19969313341637596  
(Intercept) -0.3277299280807463 0.161204140113176 NaN NaN -2.033011855965102 0.07258034063737584 *

 

I cross check the results with Minitab and RapidMiner and Minitab agree on everything except tolerance.  Minitab reports VIFs but they are simply the reciprocal of tolerance.  Here is the Minitab output

Term            Coef          SE Coef        T-Value       P-Value      VIF
Constant     -0.328        0.161             -2.03          0.073
x1               0.6099        0.0971           6.28           0.000         2.53
x2               -0.000000   0.000000       -0.15         0.888         5.58
x3               0.1783        0.0821           2.17           0.058       19.54
x4               -0.001083  0.000783       -1.38           0.200      18.24

 

The VIFs are a long way from the reciprocals of the tolerances.

 

I calculated the values directly: tolerance = 1-R-sq, where R-sq is obtained by regressing the x against all the other xs.  So for example if I drop the y and make x4 the label and re-run the process, I get an R-sq of 94.5% and the tolerance for x4 should therefore be 0.055, not 0.262

 

Am I going wrong, or is it an error?

 

Many thanks

 

David Hampton

Attachments

6 REPLIES
Moderator

Re: Linear Regression: error in calculation of tolerance

Hey David,

 

i've dived into the code and saw no real issue except forpossible numeric instabilities. Did you check to normalize first and compare the results?

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Elite III

Re: Linear Regression: error in calculation of tolerance

Or were there any other parameters modified (e.g. ridge regression value) that might be affecting the calculation?  

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Regular Contributor

Re: Linear Regression: error in calculation of tolerance

Many thanks for your prompt reply Martin.  I have checked this... normalizing changes all the coefficients and their standard errors, as you would expect, but does not affect tolerances (or p-values for that matter) so it's not being caused by that.

Attribute                       Coefficient                     Standard Error           Std. Coefficient              tolerance                      t-stat                             p-value

X1

1.8298325346724833

0.29123019471343986

0.8324179996125617

0.4913830514888129

6.283114072264977

1.438429138445052E-4

****

X2

-0.04666048376439675

0.321161266854193

-0.028568669272726246

0.40108725714143556

-0.14528677203649398

0.8876862107223876

 

X3

1.2481866266339015

0.5748493147222091

0.7990269379160296

0.4534020060669054

2.1713283719179115

0.05798789235186819

*

X4

-0.9021798318881504

0.6518333203207141

-0.4920637989900229

0.2620941472452639

-1.3840652261290682

0.1996932973423048

 

(Intercept)

0.3846904324227236

0.1089314893504385

NaN

NaN

3.531489697943573

0.0063989680350855505

***

 

A simple check to see if there is indeed something wrong is to directly calculate the tolerance: I re-ran the regression model without y and instead made x4 the label.  This directly calculates the R-sq of x4 against all the other attributes.  I get an r-squared of 0.954 and from that I can calculate that the tolerance of X4 should be 1-0.954 = 0.046 ... a long way from the figure RapidMiner gives, of 0.262.

 

Thanks for your patience  with this...

 

David

Regular Contributor

Re: Linear Regression: error in calculation of tolerance

Thanks Brian

For training purposes I begin with no feature selection, no elimination of collinear features and no regularisation.  Adding in either feature selection or removal of collinear features sweeps away some of the xs and so masks the problem with the tolerance calculations (but doesn't solve it!)... adding in regularisation makes only a very small difference - even with a ridge of 0.1 the tolerances reduce by only about 15-20% and they are several times too big... so it's not that.

cheers

David

Moderator

Re: Linear Regression: error in calculation of tolerance

David,

 

i've checked the code, which i attach here. It looks super good. I know that our LinReg got benchmarked a lot against e.g. R and went well. Did you compare it to some other tool and are you sure about your VIF interpetation? Maybe @DArnu can help. He got some background here..

 

~Martin

 

	double getTolerance(ExampleSet exampleSet, boolean[] isUsedAttribute, int testAttributeIndex, double ridge,
			boolean useIntercept) throws UndefinedParameterError, ProcessStoppedException {
		List<Attribute> attributeList = new LinkedList<>();
		Attribute currentAttribute = null;
		int resultAIndex = 0;
		for (Attribute a : exampleSet.getAttributes()) {
			if (isUsedAttribute[resultAIndex]) {
				if (resultAIndex != testAttributeIndex) {
					attributeList.add(a);
				} else {
					currentAttribute = a;
				}
			}
			resultAIndex++;
		}

		Attribute[] usedAttributes = new Attribute[attributeList.size()];
		attributeList.toArray(usedAttributes);

		double[] localCoefficients = performRegression(exampleSet, usedAttributes, currentAttribute, ridge);
		double[] attributeValues = new double[exampleSet.size()];
		double[] predictedValues = new double[exampleSet.size()];
		int eIndex = 0;
		for (Example e : exampleSet) {
			attributeValues[eIndex] = e.getValue(currentAttribute);
			int aIndex = 0;
			double prediction = 0.0d;
			for (Attribute a : usedAttributes) {
				prediction += localCoefficients[aIndex] * e.getValue(a);
				aIndex++;
			}
			if (useIntercept) {
				prediction += localCoefficients[localCoefficients.length - 1];
			}
			predictedValues[eIndex] = prediction;
			eIndex++;
		}

		double correlation = MathFunctions.correlation(attributeValues, predictedValues);
		double tolerance = 1.0d - correlation * correlation;
		return tolerance;
	}
--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Highlighted
Regular Contributor

Re: Linear Regression: error in calculation of tolerance

[ Edited ]

Many thanks Martin.

 

I have checked using R with the car package to get VIFs.  The coefficients stack up exactly with RapidMiner and R gives the same VIFs as Minitab (ie, contradicting RapidMiner)

 

Here's my R output:

 

> summary(book1Model)

Call:
lm(formula = Y ~ ., data = trial)

Residuals:
Min 1Q Median 3Q Max
-0.18858 -0.03629 -0.01287 0.02995 0.38796

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.277e-01 1.612e-01 -2.033 0.072580 .
X1 6.099e-01 9.708e-02 6.283 0.000144 ***
X2 -2.847e-08 1.960e-07 -0.145 0.887686
X3 1.783e-01 8.212e-02 2.171 0.057988 .
X4 -1.083e-03 7.825e-04 -1.384 0.199693
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1671 on 9 degrees of freedom
Multiple R-squared: 0.9376, Adjusted R-squared: 0.9099
F-statistic: 33.82 on 4 and 9 DF, p-value: 1.973e-05

> vif(book1Model)
X1 X2 X3 X4
2.532610 5.579088 19.539216 18.237488

 

So assuming that RapidMiner's code is OK, something must be wrong with my Linear Regression operator. I deleted and replaced it, no change.

For clarity the parameter setting I am using are:

Feature selection: none

Do not eliminate collinear features

Use bias

Ridge 0

 

I believe this should get me equivalent output to R and Minitab.  Still the same error.  I must be doing something wrong but feel that I have pretty much exhausted the possibilities!

 

thanks

 

David