When carrying out linear regression there are a number of assumptions that are made. Often these assumptions are not verified, whereas in reality they are very important to check if they are met (and do something about it, if they aren’t!). The topic of this post is residual diagnostic plots – how to generate them in R.
- There are a number of assumptions when fitting a linear regression when using unweighted linear least squares, and these assumptions are what can be checked with diagnostic plots:
- There is a linear relationship
- All the y values are known with equal precision (homoscedastic data – this is important for unweighted linear least squares (Section 2D I), but not assumed for weighted linear least squares (Section 2D II).
- The errors in the y values are normally distributed
- The data points are independent
Through diagnostic plots, we can check that:
- There is no trend to the residuals (i.e. that there is a linear relationship)
- The errors have constant variance
- The errors are normally distributed
- There are no influential values (outliers or high-leverage points).
To generate diagnostic plots once you have created a linear model, the code is:
par(mfrow = c(2,2))
plot(lmExample.lm)
The term "par(mfrow = c(2,2))
" asks R to plot the multiple graphs in the plot window, in a 2 x 2 grid. This command will produce four plots in the plot window when the plots are generated.
The term "plot(...)
" asks R to plot the four diagnostic residual plots for the linear model (in this case the linear model (in this case lmexample.lm) that has been calculated previously – see Part I or Part II of this section. This command will produce four plots in the plot window.
For the example data set din32645 from the envalysis package the following output is produced:
The top left plot, the Residuals vs Leverage, can help to find influential observations, if any. Influential and outlying values are generally located at the upper right corner or at the lower right corner – look for points that are beyond the Cook’s distance (dashed red line). Those spots are the places where data points can be influential against a regression line and should be removed from the data set and model-refit.
The top right plot, the Residuals vs Fitted, can identify a number of problems with the model and data.
- If the model is appropriate and fits the data well, the residuals will remain roughly uniform in magnitude as the fitted values increases, and normally distributed about 0.
- If the residuals increase as the fitted value grows, this suggests that a weighted regression model would be more appropriate.
- If the residuals show a trend, this suggests that the relationship between x and y is non-linear and a different function (i.e. quadratic) or non-linear model is appropriate.
- Residual plots can also indicate potential outliers.
The bottom left plot, the Normal Q-Q plot of residuals can be used to visually check the normality assumption. The normal Q-Q plot of residuals should approximately follow a straight line. If the points deviate strongly from this line, this indicates that the errors in y are not normally distributed (often either skewed or have long/short tails). If the data is skewed, this suggests that the relationship between x and y is non-linear and a different function / non-linear model is appropriate.
The bottom right plot, the “Scale-Location” (Studentised Residuals vs Fitted), is similar to the top right Residuals vs Fitted plot, and is particularly useful in identifying non-constant variance of the residuals.
If you do identify an issue concerning one or more of the assumptions, please make sure you do take corrective action – violating the linear regression assumptions means that your model is not appropriate for you data and any use of it could lead to incorrect conclusions…..
There is a lot of information in the diagnostic plots that you can produce and they live up to their name – diagnosing if there is a problem with your linear regression!