6.2.3 Polynomial Regression - estimation of polynomial models relating Y to powers of X


This statlet fits models relating a dependent variable Y to a single independent variable X. It fits a polynomial model of the form

Y = a + bX + cX2 + dX3 + ... + e

The method of least squares is used to estimate the model coefficients. The deviations around the regression line e are assumed to be normally and independently distributed with a mean of 0 and a standard deviation sigma which does not depend on X.

The tabs are:

Input

Summary

Fitted Model Plot

ANOVA

R-Plots

Residuals

Predictions

Models


Example

The example data shows the results of studies on the amount of chlorine available in a product as a function of the length of time since the product was produced. A total of 44 cartons were tested at ages ranging from 8 weeks to 42 weeks after production:


Input

Enter the name of the column containing values of the dependent variable (Y) and the independent variable (X):

You may use the spinners to transform either or both of the variables.

Back to Top


Summary

This tab shows the results of fitting the polynomial model:

By default, a second-order polynomial model is fit, although this may be changed by selecting the Models tab and pressing the Options button.

Of particular interest in the table are:

Estimate - this column shows the estimates of each of the coefficients in the model. The fitted model is

Chlorine = 0.540367 - 0.00826602*weeks + 0.000117085*weeks2

Std. Error - this column shows the estimated standard errors of each coefficient. The standard errors measure the estimation error in the coefficients and are used to test hypotheses concerning their true values.

t-value - this column shows t statistics, computed as

t = estimate / standard error

which can be used to test whether or not the coefficient is significantly different from zero.

P-value - the results of hypothesis tests of the form

The P-value gives the probability of getting a value of t with absolute value greater than that observed if the null hypothesis H0 is true. If the P-value falls below 0.05, then we reject the null hypothesis at the 5% significance level. We could then assert that we are 95% confident that the true coefficient is not zero.

In the case of the chlorine data, note that all P-values are well below 0.05 (even below 0.01). Of particular interest is the P-value for weeks2. Since that coefficient is significant, it shows that a second-order model is significantly better than a straight line.

Several other statistics have also been calculated:

R-squared, called the coefficient of determination, which measures the percent of the variability in Y which has been explained by the model. R-squared is defined by

R-squared = 100*(Model SSQ / Total Corrected SSQ)%

where SSQ represents the sum of squares attributable to a factor.

The adjusted R-squared, which adjusts the above statistic for the number of independent variables in the model according to

Adj. R-squared = 100*(1-[(n-1)/(n-p)]*Error SSQ / Total Corrected SSQ)%

where n is the number of observations and p is the number of estimated coefficients in the model. This is a better statistic to use when comparing models with different numbers of coefficients.

The standard error of the estimate, which estimates the standard deviation of the noise or deviations around the fitted model.

The mean absolute error (MAE), which is the average of the absolute values of the residuals (the average error in fitting the data points).

The Durbin-Watson statistic, which looks for autocorrelation in the residuals based upon their sequential order. If the data were recorded over time, this statistic could detect time trends which, if accounted for, could improve the fitted model. Unfortunately, there is no P-value associated with the Durbin-Watson statistic, and it must be compared to tables in to determine how significant it is. As a rule of thumb, when the Durbin-Watson statistics falls below 1.4 or 1.5, one should plot the residuals versus row number to see if there is any noticeable correlation between residuals close together in time. For a perfectly random sequence, the Durbin-Watson statistic would equal 2.0.

Back to Top


Fitted Model Plot

This tab shows a plot of the fitted model:

Also displayed are two sets of limits:

Prediction limits for the average of k new observations taken at a specific value of X (outer limits). By default, k=1, so the limits apply to single additional observations. When drawn at the default level of 95% confidence, they indicate the region in which we expect 95% of any additional observations to lie.

Confidence limits for the mean of many new observations taken at a specific value of X (inner limits). These limits display how well the location of the line has been estimated given the current sample size.

Options button

Use the options button to specify the type of limits to be displayed and the confidence level for those limits:

Back to Top


ANOVA

This tab displays an analysis of variance table:

The ANOVA table divides the total variability in Y into two pieces: one piece due to the model, and one piece left in the residuals, such that

Total Corrected SSQ = Model SSQ + Residual SSQ

where SSQ stands for "Sum of Squares". The R-squared statistic displayed by the Summary tab is the ratio

R-squared = Model SSQ / Total Corrected SSQ

If the data contains replicate measurements (more than one data value at the same X), the table will also display the result of a lack-of-fit test run determine whether the straight line adequately describes the relationship between Y and X. It decomposes the residual sum of squares into two pieces:

Residual SSQ = Lack-of-fit SSQ + Pure Error SSQ

In essence, what is happening is that the variability of the residuals around the fitted line is being divided into two pieces:

The table shows the results of an F test comparing the estimated lack-of-fit to pure error through

F = Lack-of-Fit Mean Square / Pure Error Mean Square

Of primary interest is the P-value associated with the test. Small values of P (less than 0.05) indicate significant lack-of-fit at the 5% significance level. In such a case, you should select the Models tab which allows you to increase the order of the polynomial.

Back to Top


R-Plots

This tab plots the residuals from the fitted model versus values of X:

By definition, the residuals are equal to the observed data values minus the values predicted by the fitted model. You may plot either:

Ordinary residuals - in units of the dependent variable.

Studentized residuals - in units of standard deviations.

The Studentized residual equals

Studentized residual = residual / sresidual

where sresidual is the estimated standard error of the residual when the line is fit using all data values except the one for which the residual is being computed. These "Studentized deleted residuals" therefore measure how many standard deviations each point is away from the line when the line is fit without that point. In moderately sized data sets, Studentized residuals of 3.0 or greater in absolute value may well indicate outliers which should be treated separately.

Options button

Enter the type of residuals to be plotted and what they should be plotted against:

Back to Top


Residuals

This table shows any Studentized residual in excess of 2.0 in absolute value:

The Studentized residual equals

Studentized residual = residual / sresidual

where sresidual is the estimated standard error of the residual when the line is fit using all data values except the one for which the residual is being computed. These "Studentized deleted residuals" therefore measure how many standard deviations each point is away from the line when the line is fit without that point. In moderately sized data sets, Studentized residuals of 3.0 or greater in absolute value may well indicate outliers which should be treated separately.

Back to Top


Predictions

This tab creates a table of predicted values for the dependent variable:

Each prediction is shown with two sets of limits, corresponding to the type of limits displayed on the Fitted Model Plot:

Prediction limits for the average of k new observations taken at a specific value of X. By default, k=1, so the limits apply to single additional observations.

Confidence limits for the mean of many new observations taken at a specific value of X (inner limits).

Use the Options button on the Fitted Model Plot tab to modify the default selections.

Options button

Back to Top


Models

This tab shows the significance of each term in the current polynomial model as it was added to the fit:

Also shown is the Adjusted R-Squared which would be achieved using a first-order model, a second-order model, etc. If the P-value for the highest order term is less than 0.05, and/or its adjusted R-squared is less than that of the model listed directly above it, you might elect to reduce the order of the polynomial by 1.

Options button

Enter the order of the polynomial to be fit (0-8):

The settings here affect the output of all of the other tabs.

Back to Top