Statistics is a lot of fun. It is filled with lots of fun words too, like **heteroscedasticity**, also spelled *heteroskedasticity*. This is a fun word for a rather odd topic. But this particular topic is essential to interpreting so many other things, like linear regression. Let’s take a deeper look into exactly what heteroscedasticity is and how it is used.

## Funny Word, Serious Statistics

Essentially, heteroscedasticity is the extent to which the variance of the residuals depends on the predictor variable. Recall that *variance* is the amount of difference between the actual outcome and the outcome predicted by your model. Residuals can vary from the model as well. The data are *heteroskedastic* if the amount that the residuals vary from the model changes as the predictor variable changes.

This can be a rather abstract definition, so let’s look at an example.

Let’s say that you’re car shopping. Of course, you are concerned with gas mileage because who isn’t? Since you are interested, you decide to compare the number of engine cylinders to the gas mileage. When you do, you get a graph that looks like this

There is a generally downward pattern. But at the same time, the data points seem to be a little scattered. It is possible to fit a line of best fit to the data. But there it misses a lot of the data.

In fact, it looks like the data points are pretty spread out at first, get closer, and then spread out again. Hmmmm. That represents *heteroscedastic* data. This means that our linear model does not fit the data very well, so we should probably adjust it.

## Why Bother with Heteroscedasticity?

Other than being fun to say, heteroscedasticity represents that the data is influenced by something that you are not accounting for. This usually means that something else is going on and we may need to revise our model.

Essentially, one can check for heteroscedasticity by comparing the data points to the x-axis. If they spread out, or converge, then this represents that the variability of the residuals (and therefore the model) depends on the value of the independent variable. This is not good for our model. This also violates one of the assumptions of linear regression. If the data is heteroscedastic, then we need to re-think our model.

### Other Tidbits

If data can be heteroscedastic, then it can be **homoscedastic** as well. Homoscedastic data is when the variability of the residuals *does not* vary as the independent variable does. If your data are homoscedastic, that is a good thing. It means that your model accounts for the variables pretty well so you should keep it.

One common misconception about hetero- and homo-scedasticity is that it has to do with the variables themselves.

It does not have to do with the variables, only the residuals!

It does not have to do with the variables, only the residuals!

You need to keep in mind that the residuals represent the error of your model. If the amount of error in your model changes as the variables change, then you do not have a very good model. Then it is time to go back to the theoretical drawing board.

Hetero- and homoscedasticity are fairly important topics in studying financial or industrial workings. Ideally, your data would be homoscedastic, but there are two types of heteroscedasticity, **conditional** and **unconditional**.

With unconditional heteroscedasticity, the variance of the residuals is not affected by the independent variable. However, with conditional heteroscedasticity, the variance of the residuals is affected by independent variable in some unforeseen way. Conditional heteroscedasticity usually shows up with time series data.

TL;DR, heteroscedasticity is the tendency of the error/residuals to increase or decrease as the independent variable changes. This tells you that your model is not stellar because there is something affecting the data that you are not accounting for in your model. Because of this, data should *not* be heteroscedastic for a good model. Happy statistics!

## Comments are closed.