I like my data to be nice and straight. Know why? Because it is easy to predict outcomes that way. But data doesn’t always come in a clean line. Sometimes, it’s how likely something happens that we want to know instead of an actual outcome. Enter **logistic regression**.

## The Gist of Logistic Regression

Recall that variables come in the *categorical* and *continuous* varieties. **Continuous variables** exist on some sort of scale, like height or gas mileage. **Categorical variables** are variables that exist as some sort of discrete unit like binary choices or multiple choice answers.

Each of these types of variables has a particular type of statistics associated with them due to their nature.

Linear regression models deal with variables that are continuous. This means that the variable is some sort of scale. For example, how many faces can I paint in an hour or how fast does ink flow from my Pilot VP based on the temperature and air pressure of the room? I could analyze correlational data to predict an outcome.

However, categorical data does not exist on a traditional scale or line. So, it *cannot *use traditional linear regression

The first thing we are going to look at is *binary logistic regression*. Let’s take a look at some data. In the graph below, you have asked a group of people if they think the temperature (in Fahrenheit) is too hot, which is shown with a response of 1.

Notice how the data doesn’t look like a normal scatter plot? The data points aren’t all over the place. They all focus on either 1 or 0 because those are the only possible answers. So, we cannot calculate the typical line like we would for linear regression because there is no “sort of” response.

Instead, we calculate the *probability* of a person responding either 1 or 0. That is what binary logistic regression is for. We use it to calculate the chance of a person responding to one of two choices.

The general model for binary logistic regression is

The P(Y) means that we are calculating the probability of an outcome of 1 occurring. The use of *e* in the denominator shows the use of logarithmic functions that create a line of best fit for the data.

Notice that the linear regression equation still makes an appearance in the denominator as an exponent of *e*. This means that they the concepts are related, but not identical. But that is for another day.

So after running the analysis, our model looks like this…

## But Wait! Is My Model Any Good?

Unfortunately, that darn logistic regression model can’t be interpreted directly like a linear regression model using *R ^{2}*. Since we are interested in the probability of an outcome instead of the actual outcome, we consider the

*likelihood*of an outcome. This is called the

**log-likelihood**.

The log-likelihood refers to adding the probability of the *predicted* outcome to the probability of the *actual* outcome so that we can get a sense of how well our model explains *all* the probability of the event occurring.

Whoa. That is a lot. Let’s take a moment to break that down.

Remember that a statistical model should explain the behaviors happening in a population. Since it can’t explain all of it, we want it to explain most of it. The same is true for the probability of an event occurring.

Ideally, the probability of an actual event occurring and the predicted probability of the same event occurring should match up. This means that our model would explain all the probability and leave no room for unpredicted responses. But as we see from our graph, that is not true.

So, the log-likelihood statistic (most often reported by research) represents how well the model fits the data. High log-likelihood means the model does a good job of explaining the reasons for the probability. Low log-likelihood means that the model doesn’t do a good job and should be revisited in future research.

There are some other ways to predict the validity of a model, but they are beyond the scope of this post.

## Should I Even Use Logistic Regression?

Now that we have reviewed what logistic regression is and what a good model means, we should check if it is even worth doing. This means checking some initial assumptions.

First, is your data *linear*? If you can, you should always create a plot of your data to get a sense of the overall shape of your data. If it follows the shape of a line, then linear regression is the way to go. That is, of course, provided your outcome is continuous instead of categorical.

Second, make sure that your data has *random errors*. Just as with linear regression, any variable not accounted for in the regression equation should be random so that it doesn’t mess with your model.

Third, you should check for *multicollinearity*. Your predictors shouldn’t be too related to one another. Otherwise, they are explaining too much about each other than they are about the outcome or probability of the outcome.

Ultimately, logistic regression is still a method of using predictors to predict outcomes. The biggest difference is that the predictors predict the *probability* of a specific outcome instead of a specific outcome. They can’t be interpreted the same way as linear models, but they are not some super scary statistical nightmare either. They also have some things to check for before you go rushing in. Don’t worry, you can do it! Happy statistics!

## Comments are closed.