Regression Diagnostic Plots
Kirk Steinhorst

There are a variety of plots used in determining if you have built a statistically sound regression model. We will consider


We will illustrate the various plots with an agricultural economics example. We start with basic statistics including correlations . An iterative process ( SAS , out1 , out2 , out3 ) leads us to a three variable model (see the reduced correlation matrix ).

  1. Let's first check the underlying assumptions of this model. We will do so by first plotting the residuals (or PRESS residuals) versus y (or yhat). Then we will plot the residuals versus X2 = PPO (or any other independent variable) and versus the sample order. See the associated SAS code. Since there is no major problem with these residual plots, we will look at normality by considering a box plot and a histogram (sometimes a dot plot is best). We can also look at a normal probability plot of the residuals (but we have a small sample size here).

  2. The difficulty with multiple regression is that usually the independent variables are NOT independent in the statistical sense. There is overlapping information about y in the various x's. The partial regression plot is the best tool available for isolating the effect of a particular x on y (see slide1 and slide 2 ). There is one partial regression plot for each independent variable in the model. Here we illustrate the effect of PPO on PBE (SAS code) . The slope of the line is bj.

  3. Another plot that helps us look for nonlinearity in Xj is a plot of the residual + bj*Xj versus Xj. The slope is again bj, but the horizontal axis is Xj itself rather than Xj adjusted for the other independent variables. Again we use Xj = PPO for illustration ( SAS code attached).

  4. The augmented partial residual plot looks for nonlinearities in an X variable by adding a second order term in that variable to the model. We then plot residual + bj*Xj + b?*Xj 2 versus Xj. Experience shows that this plot is sometimes more sensitive to nonlinearities in Xj. ( Plot and SAS code )

We have illustrated THREE fundamental sets of diagnostic plots--residual plots for assumptions (1. above), partial regression plots (2. above), and partial residual plots (3. and 4. above). Look at ageconda.sas for a complete listing of the program and the accompanying output .

This does not exhaust the plots one can do in regression. Some people start with a CASEMENT plot--a plot of every variable against every other variable. This can be useful, but, if y is related to two other variables and not particularly to either one singly, then such plots will do no good. An example of this is y=soil respiration, x1 = soil temperature, and x2 = soil moisture.

Another plot is the RIDGE TRACE plot.. One solution to multicollinearity is the addition of a small value (k) to the diagonal of X'X. To determine k, one can plot the values of each bj versus k. The smallest k for which each ridge trace plot shows stability in the coefficient is the k to adopt. Norm Draper who did a lot of the early work on ridge regression no longer believes in it. I agree with Norm. Ridge regression makes statisticians feel warm and fuzzy, but how does one interpret the results in the real world?


For a second example of diagnostic plotting in regression see the heat transfer example ( SAS code and output ). This example is interesting because the X's are somewhat uncorrelated.

A third example is given by the hospital personnel prediction problem . This problem has collinearity and nonlinearity. See the output .