In life and statistics, a lot of questions are way more complicated than they appear on the surface. Take buying a car. There are a few things to consider beyond how wicked awesome the car looks flying down the highway. Multiple regression analysis is how you can statistically consider numerous things simultaneously.
In a previous post, we learned how one variable can predict an outcome. However, sometimes more than one thing can predict an outcome. For example, the gas mileage on a car is affected by several things like the weight of the car and horsepower of the engine.
To really know how two or more variable predicts an outcome, you need a statistical method called multiple regression.
Multiple regression involves creating a model that shows how much each predictor explains the outcome. So, in the case of our car, how much of the gas mileage is because of the weight of the car and how much is because of the horsepower of the engine?
The Multiple Regression Model
The standard multiple regression model looks like
Don’t be scared, it’s just an equation. A very, very long equation.
In this equation, Yi represents the outcome variable; b0 represents the constant when all other variables are hypothetically zero; the other b’s represent the coefficients that tell you how much each variable affects the outcome; and represents the error that goes with each turn.
Now, the equation is not just putting two equations together. So let’s take a closer look at how the weight of a car affects the mileage.
When you run the linear regression for just the ability of the car’s weight to predict the gas mileage you build the following model
When you run the linear regression for just the ability of the car’s horsepower to predict the gas mileage, you build the following model
It is tempting to think that a model with both predictors would be
But, three variable means three dimensions to consider. That means a graph would look like this
I know right?! It looks super cool! But the line of best fit thing is a lot trickier for three dimensions than it is for two.
Essentially, one finds the plane or multidimensional shape of best fit for all the predictors. Mathematically, you have to determine if best fit that one point has from two or more other axes.
What this means for us is the multiple regression model is not just some linear regression equations put together. It is a separate calculation of linear regressions in relation to one another. In the case of our car example
It is similar but still different. It is different enough to affect car buying decisions, but just how good is the model itself?
How Good is My Model?
As with all statistical models, multiple regression models vary in the strength of their ability to predict the outcome based on the data collected in the sample. Multiple R is a value that is analogous to the correlation coefficient for linear regression since it tells us the relationship that our predicted values have to actual values.
In fact, multiple R and R2 can be interpreted the same way. Values of R2 < 0.3 are generally considered poor, while values between 0.4 and 0.6 are moderate, and anything above 0.7 is wicked awesome, ahem, strong. Of course, these are all “rule of thumb” figures.
One cautionary note with R and R2 is that the number of predictors has an effect on both of them. In general, the more predictors that you try to incorporate into the model, the lower your values for R and R2 may be.
Therefore, try to only include predictors that previous research indicates may have an effect on the outcome. No sense in muddying the waters unnecessarily.
Our car example, the value of R2 is 0.83, which indicates that the car’s weight and horsepower both strongly and negatively predict the car’s gas mileage.
You Know What Happens When You Assume? You Get a Model!
There are specific tests to diagnose the accuracy and significance of your model that are complex and a tad beyond the scope of this post. But there are some general assumptions that you should consider when interpreting your model.
Before you even start the method, you should consider your variable types. Do you have continuous variables like horsepower or categorical variable like model year? One gives a nice predictable outcome, but model year is a little more constrained because it doesn’t really tell us how much the engine has been used.
Are any of the predictors “perfect”? What I mean is, do all the measurements for a predictor have the same value? If so, then they probably don’t have an effect on the outcome.
The Effect of External Variables
The goal of science, both physical and social, is to understand the unknown. That means separating the known from the unknown. Multiple regression analysis should clearly define predictor variables that are not affected too heavily by outside factors. If you think something, like race or gender, would affect your analysis it should be included in your model or else it would be a source of unexplained error.
Not only is this word fun to say, it is essential to justify our data and analysis as “normal”. In a nutshell, it means that the difference between the actual data and our line/plane of best fit represents a normal distribution. There is more to it, but that is for another post.
No Perfect Multicollinearity
The idea of multiple regression is that two or more variables are related to each other in some way. But if two of the predictors are perfectly related (called multicollinearity) then they predict more about each other than either of them do about the outcome. This means you don’t have a useful, generalizable model. The predictors should correlate a bit, but not too well.
I was trained as a chemist, so I used to believe that it was impossible to account for the all the random errors in people when doing social science research. But that before I really got into statistics (and boy do I really dig statistics.). People and social events have errors just like particles and planetary motion, but we have to make sure that they are random. When doing multiple regression analysis, certain tests check to make sure that the errors are truly random and not related. If they were related, we would need to adjust our analysis and our model.
Ultimately, multiple regression analysis is another statistical tool in your toolbox. It comes in many forms that allow you to model the various relationships between numerous variables. Use it wisely, and you will do great things with it. Like, buy an awesome car with great gas mileage!