I love Oreos – I mean LOVE them. About once a year, I will eat an entire pack in one sitting and feel both proud and ashamed. But I always wonder, do I eat them because I’m hungry or because they are conveniently located at hand level at the grocery store. Many stores actually study the different variables like the location of food and sugar content to determine the best place to sell certain foods. Ever notice how the vegetables are almost always in the front? In statistics, studying these variables can be done through **regression analysis**.

Let’s take a look at what regression analysis really entails with a delicious example.

## What is Regression?

Simply put, a regression is a mathematical relationship between a dependent variable (the *outcome*) and the independent variables (called *predictors*). The actual relationship is calculated based on the effect that these predictors. The easiest way to describe the relationship is with a mathematical model.

For our first example, let’s say that you want to determine if the time of day has a significant effect on the purchase of Oreos. We would map out the model with like this

*Number Oreos Purchased* = *b _{0} + b_{1}(Time of Day) + ε*

*
*This gives us an equation that allows you to predict an outcome based on the variables involved.

### Variables

In this case, the number of Oreos is the outcome variable that we are interested in. This would be what is called a *continuous variable* since it is a range of real numbers that can range to infinity. The predictor variable is the time of day, once again, a continuous variable.

You can do regression with any kind of variable like ordinal or ratio. You simply have to adjust your interpretation and method.

### The Other Stuff

The other stuff models the relationship, so they require a little more effort to interpret.

*b _{0}* is called the intercept. This is interpreted as the prediction when the predictor is zero. In our case, it would be the number of Oreos sold when the time of day is 00:00 hours. The

*b*represents the actual relationship between Oreos sold and the time of day. It will always be some kind of ratio.

_{1}ε is a necessity in regression that requires some more careful interpretation. It represents the error that is naturally present in the regression. For example, some people will buy Oreos for a variety of reasons other than the time of day. If the sample is truly random, ε takes those natural errors into account.

Of course, the more predictors you have in your model regression model, the smaller that error should be. But let’s get into the two major types of regression.

## Simple Regression

In a simple linear regression, you are considering the relationships between a single predictor and a single outcome. These are considered the independent variable (X) and the dependent variable (Y), respectively. When looking for this relationship, you should look at a plot of the data first. An example looks like

The blue dotes represent the data in the form of Cartesian points. Once you have the data set, you can perform the calculations that lead to the individual mathematical relationships.

To calculate the intercept and coefficient, we use the equation

The *b* represents the intercept in this particular formula. It measures the variance of each data (numerator) compared to the variance of the data set that is the product of the two data sets.

The *a* represents the effect of the difference between X and Y based on the sample size. It is dependent on the relationship between the two variables.

The equation is easily rearranged into a simple linear formula for a line. This is called the *line of best fit*

## Multiple Regression

Of course, it is more than the time of day that affects the purchase of delicious Oreos. When I go shopping, I usually grab the stuff that is higher from the ground because I am tall. So distance from the floor may make a difference.

In the case of multiple predictors, you use something called *multiple linear regression*.

This takes into account that the predictors may also affect one another. The general model for multiple linear regression is

*Y* = *b _{0}* +

*b*X

_{1}_{1}+

*b*X

_{2}_{2}+ … +

*b*X

_{n}_{n}

Multiple linear regression takes into account that multiple variables not only affect the outcome but also affect one another. It is even possible to have two or more variables interact with each other, much like ANCOVA.

## Assumptions

As with any type of statistical analysis, there are assumptions about the data that should be met in order to interpret the pattern in the data effectively. There are four main assumptions for all linear regressions.

*Linearity* is the assumption that the data follows in a straight line fashion. This is a fairly easy assumption to test. On a plot of the data sets for simple linear regression, the data should generally follow a line of some point. They should not curve up or down, that is an example in which you would use quadratic regression, which we won’t discuss here.

The second assumption is *independence*. This means that the variables are independent of one another. This is particularly important for multiple linear regression. Since you have multiple predictors, you should look to see if any two predictors have a high correlation with each other. If they do, their relationship will mask any other relationships.

*Normality* is the third assumption. You should make sure that the residuals (the difference between the data point and the line of best fit) are normally distributed. This means that the average difference can be interpreted according to the normal distribution. It also means that various other tests, like the *t*-test can be used for analysis of difference.

The final assumption is *equal variance*. The data should follow the line and the residuals should not get bigger or smaller over time. If the data is heteroscedastic, then linear regression will give you the wrong idea about patterns in the data. It also means that there is another important variable that you are not considering.

## Diagnostics

Once you have your model with actual numbers, it comes time to diagnose whether the model represents the pattern in the data. There are two main diagnoses tools for linear regression, *r* and *R ^{2}*.

*r* represents the strength of the correlation between X and Y. It can vary between -1.0 and +1.0. The closer *r* is to either of those, the stronger the correlation. A value of 0.99 means that the points are all very closer to the line of best fit and has a strong positive relationship. A value of 0.3 means that the data points vary away from the line of best fit, meaning that the model is not a good one. A word of caution with *r*, it does not explain how much the variables affect each other.

However, *R ^{2}* does.

*R*represent how much the variance is one variable is explained by the variance in the other. Essentially, how closely does one variable affect the other. It is calculated just like you think,

^{2}*r*. In the case of simple regression, it is

^{2}*r*, but in multiple linear regression it is

^{2}*R*because it is accounting for multiple correlations.

^{2}Both of them are interpreted based on their magnitude. A value of 0.0-0.3 is considered a weak correlation and a poor model. 0.4-0.6 is considered a moderate fit and OK model. Any value higher than 0.7 is considered a strong correlation, which means that the model is a good one.

There are other tests of goodness of fit models such as chi-square and the *F* statistics, but they are beyond the scope of this post.

## The Takeaways

Linear regressions are a means of figuring how variables in the data predict and explain the outcome. There are multiple types of regression based on the number of predictors. Each method has assumptions, such as linearity, that must be addressed in interpreting the model and the data. Also, there are a variety of diagnostic tools for explaining the model and interpreting if it is a good one. I’d be stoked to see your questions below. Happy statistics!

## Comments are closed.