Regression Diagnostic Plots
Kirk Steinhorst
There are a variety of plots used in determining if you have built a statistically sound regression model. We will consider
-
Residual plots for testing the basic underlying assumptions.
-
Partial regression plots for isolating the effect of a single X variable.
-
Component plus residual plots (also called partial residual plots) for assessing nonlinearity in X.
-
Augmented component plus residual plots (also called augmented partial residual plots) used as an extension of the component plus residual plots.
We will illustrate the various plots with an
agricultural economics
example. We start with basic statistics including
correlations
. An iterative process (
SAS
,
out1
,
out2
,
out3
) leads us to a three variable model (see the
reduced correlation matrix
).
-
Let's first check the
underlying assumptions
of this model. We will do so by first plotting the residuals (or PRESS residuals) versus
y
(or yhat). Then we will plot the residuals versus
X2 = PPO
(or any other independent variable) and versus the
sample
order. See the associated
SAS
code. Since there is no major problem with these residual plots, we will look at normality by considering a
box plot and a histogram
(sometimes a dot plot is best). We can also look at a
normal probability plot
of the residuals (but we have a small sample size here).
-
The difficulty with multiple regression is that usually the independent variables are NOT independent in the statistical sense. There is overlapping information about y in the various x's. The
partial regression plot
is the best tool available for isolating the effect of a particular x on y (see
slide1
and
slide 2
). There is one partial regression plot for each independent variable in the model. Here we illustrate the effect of
PPO on PBE
(SAS code)
. The slope of the line is bj.
-
Another plot that helps us look for nonlinearity in Xj is a
plot of the residual + bj*Xj versus Xj.
The slope is again bj, but the horizontal axis is Xj itself rather than Xj adjusted for the other independent variables. Again we use
Xj = PPO
for illustration (
SAS code
attached).
-
The
augmented partial residual plot
looks for nonlinearities in an X variable by adding a second order term in that variable to the model. We then plot residual + bj*Xj + b?*Xj
2
versus Xj. Experience shows that this plot is sometimes more sensitive to nonlinearities in Xj. (
Plot
and
SAS code
)
We have illustrated THREE fundamental sets of diagnostic plots--residual plots for assumptions (1. above), partial regression plots (2. above), and partial residual plots (3. and 4. above). Look at
ageconda.sas
for a complete listing of the program and the accompanying
output
.
This does not exhaust the plots one can do in regression. Some people start with a CASEMENT plot--a plot of every variable against every other variable. This can be useful, but, if y is related to two other variables and not particularly to either one singly, then such plots will do no good. An example of this is y=soil respiration, x1 = soil temperature, and x2 = soil moisture.
Another plot is the RIDGE TRACE plot.. One solution to multicollinearity is the addition of a small value (k) to the diagonal of X'X. To determine k, one can plot the values of each bj versus k. The smallest k for which each ridge trace plot shows stability in the coefficient is the k to adopt. Norm Draper who did a lot of the early work on ridge regression no longer believes in it. I agree with Norm. Ridge regression makes statisticians feel warm and fuzzy, but how does one interpret the results in the real world?
For a second example of diagnostic plotting in regression see the heat transfer example (
SAS code
and
output
). This example is interesting because the X's are somewhat uncorrelated.
A third example is given by the
hospital personnel prediction problem
. This problem has collinearity and nonlinearity. See the
output
.