The Assumptions of Regression

Ordinary Least Squares regression, or OLS for short, is the workhorse quantitative tool in most social science fields. We like using OLS for several reasons. The sum of its squared errors is 0, which is a cute little curiosity to most people but it has some rather useful statistical properties. It’s unbiased as an estimator – if the assumptions are met, then the expected value of the sample coefficients equals the true value of the underlying population parameter. (In other words, it gets the answer right.) Perhaps even better, if those same assumptions are met, then OLS is the most efficient linear estimator; OLS’s standard errors of the coefficient will be smaller than the standard errors for any other linear estimator. (In other words, the answer will not only be right, but it will be the most certain that we can possibly be: the sampling distribution of the estimated coefficients will be the skinniest we can get it.) And best of all, as the sample size gets larger and larger, then the OLS estimated coefficients will converge on the true underlying coefficients and the variance of the estimated coefficients will near zero.

For OLS to have these wonderful properties, though, it has to satisfy six important assumptions. Nearly all these assumptions focus on the error term, or the unsystematic (‘stochastic’) bit of the outcome that is not explained by the independent variables. Each observation has an error value associated with it; it’s the bit of the outcome left over between the predicted value of the dependent variable and the observed value of the dependent variable. These error terms, both singly and jointly, affect a lot of the properties of OLS.

The six key assumptions are:

The regression model is linear, is correctly specified, and has an additive error term.
All explanatory variables are uncorrelated with the error term.
The error term has a zero population mean.
Observations of the error term are uncorrelated with each other (no serial correlation).
The error term has a constant variance (no heteroskedasticity).
No explanatory variable is a perfect linear function of any other explanatory variables (no perfect colinearity).

A seventh assumption is optional but commonly invoked: The error term is normally distributed. The six assumptions in the list collectively constitute the Gauss-Markov Theorem, which says that when all six are met, OLS is BLUE: the Best Linear Unbiased Estimator. We’ll discuss each of these in turn, but I should note first that in the interests of accessibility, the discussion below is highly non-technical. Those desiring a more technical presentation, including explanations and implications of BLUE, should consult the items in the Further Fun list at the bottom of this page.

1. The regression model is linear, is correctly specified, and has an additive error term.

This is pretty much the only assumption that we as researchers have any control over meeting, and even there, we only can help a bit. The ‘regression model’ here refers to the actual model specification that you estimate. In short, to meet this assumption, we have to have:

A correct understanding of all the systematic influences that affect the DV (i.e., we know all the IVs and have them entered into the model – they’re included in the model specification)
These systematic influences (IVs) must produce the DV value by adding together, not by serving as exponents for each other or by other unusual means, and
The error term (the residual or unsystematic part of the DV value left over after the IVs are accounted for) simply adds to the total systematic parts to produce the DV value.

Of these, (b) and (c) are parts of the underlying theoretical model; (c) in particular is unusual enough in the social sciences that I can’t think of a real example to give, so for 99.99% of applications, we can assume it is met. (And remember that “addition of negative numbers” is another name for subtraction; negative addition still meets these criteria.)

Only (a) is something that we the researchers can affect. The efforts we put into theorizing, understanding the control variables other researchers use, and collecting valid data all contribute to meeting this assumption, and so they contribute directly to the validity of our results. If we don’t do these things right, then our OLS estimates are going to be wrong in some way, and our conclusions will be wrong as a result. Of course, we can’t do all of these things perfectly. But we do have an obligation to do them as best as we possibly can, to apply appropriate corrections to the model if we know we violate one or more of these (i.e., omitted variable bias, interaction terms), and to constrain our conclusions as necessary to accommodate violations of these assumptions that may affect our results.[1]

2. All explanatory variables are uncorrelated with the error term.

The next four assumptions concern the error term. Assumption 2 concerns the error term’s relationship with the explanatory variables, and 3-5 focus on relationships among the errors.

The assumption that the explanatory variables are uncorrelated with the error term is related to our previous assumption about the model specification as well as to our other assumptions about the error terms. If a model is correctly specified, we have captured all of the systematic variation in the outcome with our IVs. In this case, all of the remaining variation – the error terms – should be uncorrelated with (unrelated to) the explanatory variables. That’s what it means to have captured all the systematic variation: We have nothing left, no unexplained variance, that looks like it moves in the same pattern as any other explained parts of the variance.

If we don’t have all the relevant explanatory variables in our model, though, this assumption breaks down. Some parts of the unexplained variation (the error term) end up being correlated with other explanatory variables that themselves are correlated with the omitted variable.[2] And as we discussed in the book, the model misattributes that correlated variation to the included variables and comes up with all sorts of wrong answers as a result (i.e., omitted variable bias occurs). And since bias is occurring, OLS clearly can’t be BLUE as that requires unbiasedness.

We can evaluate, to some extent, whether this assumption is satisfied by checking the residual-versus-fitted-value plot (rvfplot in Stata); violations will result in patterns in the error points such as a U-shaped or inverted-U-shaped curve, a downward or upward slope, or other clearly distinguishable pattern.

3. The error term has a zero population mean.

4. Observations of the error term are uncorrelated with each other (no serial correlation).

5. The error term has a constant variance (no heteroskedasticity).

Now we get to the trickier ones. They all concern the relationship between the errors associated with each observation.[3]

In a properly specified regression, the errors will have a zero population mean. (Don’t believe me? Run a quick regression on a few columns of random numbers [use the RANDBETWEEN function in Excel to generate them quickly; aim for an N of 150+], save the errors, and then take their mean. Betcha it’ll be 0 or pretty darn close to it.)[4] Many kinds of specification problems, however, can result in non-zero error means. A model exhibiting heteroskedasticity, for example, almost certainly shows this. Models on data that suffer from selection bias often show it as well, since a biased subset of the full population is omitted from the model.

Our key regression assumptions also require that the error terms be uncorrelated with (that is, independent from) one another. In plain English, the error term of one observation should not help us predict the error on another observation – especially not the next observation if the observations are sequential (time-series) on the same unit. For example, let’s think of explaining a country’s inflation rate. Inflation, as any politician will tell you, is a persistent problem. Its value doesn’t change quickly, particularly downward. So, what if we wanted to predict inflation in the second quarter (Q2) of year t? Our best guess would be Q1’s inflation, plus or minus some small amount. And then, to predict Q3, we’d look at Q2, plus or minus some small amount. These observations are, in technical jargon, serially correlated (or autocorrelated): each one predicts the next one in the sequence fairly well.

Serial correlation can occur in any kind of time-series or panel (time-series cross-sectional) data.[5] For example, consider household energy usage. In the northern hemisphere, consumption of electricity is systematically higher in summer (air conditioners) and winter (electric heat) than in fall and spring; consumption of fuel oil and natural gas perpetually increases in fall and peaks in winter before falling again. In this case, observations will systematically predict each other in a recurrent pattern; observations predict other observations with a four-season lag. When observations predict one another like this, the assumption of independence of errors is violated. If inflation is unusually high in Q3 (i.e., has a large positive error), then we would also expect it to be unusually high in Q4.

When we have serial correlation, OLS estimates are still unbiased (the coefficients are ‘correct’), but they are inefficient (meaning that OLS thus fails the “best” part of BLUE). OLS estimates in the presence of autocorrelation will result in reported standard errors that are (usually) smaller than their true values. As a result, we mistakenly think the estimates are more precise than they actually are; we’re more likely to reject the null (claim support for our hypotheses) than we should be.[6] This leads to incorrect conclusions – namely, we make more Type I errors.

Finally, we have the issue of heteroskedasticity. Formally, this occurs when the variance of the errors is not constant across observations. In our ideal world, a plot of error terms versus any independent variable should produce a nice even dispersion of errors around the regression line, with a similar range (vertical spread) around the line across the whole plot and no obvious patterns or trends in the scatterplot.[7] (This condition is called homoscedasticity, “same variance”.) When the assumption of homoscedasticity is violated, we typically observe funnel-shaped or hourglass-shaped distributions of errors around the line, where the variance (spread) is systematically larger in some parts of the distribution than at others.

Unlike serial correlation, which always occurs in time series data, heteroskedasticity can come from a variety of sources, in any kind of data, and different forms of it may be hidden by violations of other assumptions. It can arise from measurement error. Individuals with higher incomes may have poor estimates of the fraction of their income spent on leisure activities; both over- and under-estimation are plausible. For lower income individuals, the reported percentage will exhibit a much smaller range, so by definition their responses are unlikely to be too far off the true value (i.e., their estimates are more precise). Differences among subpopulations, such as between Blacks and whites or between EU countries and non-EU countries, could also produce heteroskedastic results. Instances where an extreme value on an IV is a necessary but not sufficient condition for extreme values on the DV can also result in heteroskedastic errors.[8]

The effects of heteroskedasticity on OLS are largely the same as those of autocorrelation. OLS estimates are unbiased (and consistent), but they’re inefficient. The same problems occur as before; standard errors are incorrect, and we are usually prone to making more Type I errors. Heteroskedasticity causes OLS to fail on the “best” part of BLUE because other estimators are more efficient. If you suspect that your data may be heteroskedastic for any reason, try using a residuals versus fitted values plot (run the command “rvfplot” immediately after estimating the regression; I suggest you specify the yline(0) option to ease interpretation of results). If your plot reveals suspicious patterns, Stata can also help detect heteroskedasticity using the “hettest” command; small p values indicate a violation of the assumption of homoscedasticity. (See Stata’s help or the UCLA page in the Further Fun list for more information on how to interpret the results.) If your test reveals heteroskedasticity, check with your instructor or a consultant at your school’s statistics lab for guidance about how to rectify the situation.

In short, violation of these assumptions can have very serious effects on your conclusions. If you have any suspicion that any of these violations may be occurring in your data, I strongly encourage you to deploy the appropriate test and consult with your instructor or stats lab staff as necessary.

6. No explanatory variable is a perfect linear function of any other explanatory variables (no perfect colinearity).

This seems like a pretty straightforward assumption. We’ve discussed it at several points in the book, such as our discussions of colinearity and the need for variation in our variables. Two constant variables, for example, are perfect linear functions of one another. If x1 = 2 for all cases – say, it’s an intermediate value of some ordinal variable – and x2 = 47, then x2 = x1* 23.5.

A sneakier instance of this, however, can happen when using dummy variables. Let’s imagine that we asked a bunch of survey respondents to indicate whether they are male or female, and we used this data to construct two variables, one indicating whether a respondent is male and one indicating if the respondent is female. If we enter both of these into a regression at once, though, we will violate Assumption 6: Since all respondents are either male or female, but not both or neither, then Male + Female = 1 for all cases. The same thing can happen with geographic region or continent (all countries are somewhere), or whether someone filed their taxes or not (you did or you didn’t), or electoral system. This is why we always need to have an omitted or reference category when using dummy variables. One of the categories needs to be excluded from the model to allow the model to calculate effects for the other variables.

So why do we need this assumption? The answer is mostly computational; perfect correlation results in the model trying to divide by 0, which doesn’t work. Stats software often calculates the regression constant by inserting a variable equal to 1 for all observations and then proceeding through the matrix math to calculate the constant and coefficients. (It works in matrix math. Trust me.) In fact, if you accidentally try to enter a combination of variables with a linear sum of 1, the software will either drop one of the offending variables and calculate normally, or simply spit the model back out and refuse to calculate anything.

Conclusion

As you can see, a lot is going on under the hood of a simple OLS regression. The data need to satisfy a number of important assumptions to produce results that we can trust. In a typical first-term stats class or undergraduate methods course, we wouldn’t necessarily be too concerned with whether you tested all of these. We’re more interested in getting you to try empirical research and/or quantitative methods. But it’s worth being aware that common data problems can have major effects on your findings.

Further Fun

X. Chen, Ender, P., Mitchell, M. and Wells, C. (2003). Regression with Stata, from http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm – Chapter 2 gives details of how to test for violations of many of the assumptions listed here. Comparable web books available for SPSS, SAS and R, though book titles vary across platforms.

Oscar Torres-Reyna. 2003. “Linear Regression Using Stata, v6.3.” Princeton University Data and Statistical Services. http://dss.princeton.edu/training/Regression101.pdf Lots of help in interpreting Stata regression results, including syntax and interpretation for various tests.

A.H. Studenmund. 2010. Using Econometrics: A Practical Guide, 6e. Prentice Hall. Ch 1-4 are definitely the most accessible of the technically precise presentations listed here.

William D. Berry. 1993. Understanding Regression Assumptions (Sage University Paper series on Quantitative Applications in the Social Sciences, 07-092. Newbury Park, CA: Sage. Moderately technical; scalar notation but significant assumption of mathematical background. Appropriate for undergrads or MA students with advanced math backgrounds or PhD students.

Peter Kennedy. 2008. A Guide to Econometrics, 6e. Wiley-Blackwell. Classic text containing both intuitive and moderately technical explanations. Presumes that readers are taking/have taken an econometrics course and need clarification or a refresher, but students who aren’t afraid of math notation should be able to handle it.

[1] Stata’s ovtest command can help determine whether the model requires additional variables, but remember that it is an entirely atheoretial test.

[2] In fact, we can use post-regression plots to help determine whether we have omitted any variables. We simply use the added variable plot (avplot); clustered bunches of errors suggest that we’ve missed something. Change the points to include labels to determine what observations are clustered, and use that to try to figure out what you’re missing.

[3] Sometimes you’ll see assumptions 3, 4 and 5 grouped together as a single assumption that errors are “i.i.d.” – identically and independently distributed.

[4] By construction, the squared errors of an OLS model always sum to zero.

[5] It can also occur as a result of omitted variables that exhibit growth over time, or as a function of systematic measurement errors; I’ll leave those topics for your second course of econometrics.

[6] The Durbin-Watson statistic is the usual test for the presence of serial correlation. This statistic ranges between 0 and 4, with values near 2 indicating no serial correlation.

[7] An alternative visual plots the residuals versus the fitted values, normally with a horizontal line at y = 0. Both give you essentially the same information, but one version plots them around a horizontal line (the rvf plot) and the other around a sloped regression line. I find the rvf plot easier to interpret, but most examples use the regression line approach.

[8] I thank Richard Williams at Notre Dame for this insight; see his wonderful grad stats notes and Stata handouts at https://www3.nd.edu/~rwilliam/ for more details.

Preliminaries

Practicalities

Prose & Processes

The Math

Archives

Categories