# What is Heteroscedasticity?

Statistics is a lot of fun. It is filled with lots of fun words too, like heteroscedasticity, also spelled heteroskedasticity. This is a fun word for a rather odd topic. But this particular topic is essential to interpreting so many other things, like linear regression. Let’s take a deeper look into exactly what heteroscedasticity is and how it is used.

## Funny Word, Serious Statistics

Essentially, heteroscedasticity is the extent to which the variance of the residuals depends on the predictor variable. Recall that variance is the amount of difference between the actual outcome and the outcome predicted by your model. Residuals can vary from the model as well. The data are heteroskedastic if the amount that the residuals vary from the model changes as the predictor variable changes.

This can be a rather abstract definition, so let’s look at an example.

Let’s say that you’re car shopping. Of course, you are concerned with gas mileage because who isn’t? Since you are interested, you decide to compare the number of engine cylinders to the gas mileage. When you do, you get a graph that looks like this There is a generally downward pattern. But at the same time, the data points seem to be a little scattered. It is possible to fit a line of best fit to the data. But there it misses a lot of the data. In fact, it looks like the data points are pretty spread out at first, get closer, and then spread out again. Hmmmm. That represents heteroscedastic data. This means that our linear model does not fit the data very well, so we should probably adjust it.

## Why Bother with Heteroscedasticity?

Other than being fun to say, heteroscedasticity represents that the data is influenced by something that you are not accounting for. This usually means that something else is going on and we may need to revise our model.

Essentially, one can check for heteroscedasticity by comparing the data points to the x-axis. If they spread out, or converge, then this represents that the variability of the residuals (and therefore the model) depends on the value of the independent variable. This is not good for our model. This also violates one of the assumptions of linear regression. If the data is heteroscedastic, then we need to re-think our model.