In statistics, there are numerous ways to analyze data. The whole field revolves around it. Of course, the way that you analyze data depends on both what you want to know and what the data actually are. One of the most common ways to analyze data is using regression models. These types of models estimate patterns in the data using something called **ordinary least squares** (OLS) regressions. As with any statistical test or method, OLS must meet certain assumptions to be valid. In this post, we’re going to investigate what OLS assumptions are and what they mean.

## What is OLS?

Ordinary least squares regression, OLS for short, is a method of determining the relationship between two or more variables. It is the primary method of linear and multiple linear regression. It works by minimizing the variance between the actual and predicted values of the line of best fit. It is the primary method for creating regression models.

Like many statistical methods, there are a set of assumptions that should be met to fully utilize the capabilities of OLS. If the OLS assumptions are not met, then you run the risk of using a method that will not provide the correct interpretation of the data.

## Assumption 1: The linear regression model is “linear in parameters.”

“Linear in parameters” is a tricky term. It basically mean that the data follow a linear pattern. This is a condition of the correlation of the data.

Although the data do not have to be in a perfect line, they should follow a positive or negative slope for the most part. This is different from a quadratic or cubic pattern that follows a curve. and, of course, if there is no correlation, then OLS will not be the best method to use to find a patter in the data.

## Assumption 2: There is a random sampling of observations

In order to run a full and appropriate regression, you want to ensure that the sample is drawn randomly from the population. If the sample is *not* random, then you run the risk of introducing an unknown factor into your analysis that OLS will not account for.

It is also imperative that your independent variable is theorized to *cause* your dependent variable. OLS is a causal statistical method that investigates the ability of the independent variable to predict the dependent variable. This means that you are looking for a causal relationship instead of a correlation.

Another note regarding sampling is that yo should have many more in your sample than you do independent variables. For example, if your regression model has 5 variables, then you should have at least 50 measurements in your sample, though this is not a hard and fast ratio; it is merely a rule of thumb. If you were to have 30 independent variables and only 20 in your sample, then OLS will give you a false analysis.

## Assumption 3: The conditional mean should be zero

This means that the average of your error terms for each measurement should be zero. This indicates that there is no relationship between the errors and your independent variable. If the conditional mean is too far from zero, it means that something you did not account for has an effect you did not predict.

## Assumption 4: There is no multi-collinearity (or perfect collinearity)

Multi-collinearity or perfect collinearity is a vital assumption for multiple linear regression. Collinearity means that two (or more) of your independent variables have a strong correlation with one another. If this were the case, then there is a strong relationship/effect between two (or more) variables that you did not account for. This would mean that your OLS regression will give you a false model.

One reason that this is so important is that OLS uses variance to analyze the difference between the actual value and the predicted value. If there is no variance (as is the case with collinearity) then there can be no OLS.

## Assumption 5: There is homoskedasticity and no autocorrelation

Heteroskedasticity is a measure of the spherical nature of the data. You can recognize heteroskedastic data visually if your data shows the shape of a cone rather than a line. If you data is heteroskedastic, then the variance varies as the data changes. Since OLS is based on variance, you want a consistent squared variance instead of changing variance.

Just a note, if your data *is* heteroskedastic, then there is most likely a variable that you are not accounting for in your model.

## Assumption 6: Error terms should be normally distributed

Some statistics texts cover five assumptions and some include six. I want to cover it because it is vital for other types of regression models.

Errors’ being normally distributed is a condition of Assumption 3. This means that your errors that are positive cancel out your errors that are negative. This is something that you should check once you have your model.

## The Takeaways

OLS is the basis for most linear and multiple linear regression models. In order to use OLS correctly, you need to meet the six OLS assumptions regarding the data and the errors of your resulting model. If you want to get a visual sense of how OLS works, please check out this interactive site. I hope this post helped clarify some things, and I hope to see any questions that you have below. Happy statistics!

## Comments are closed.