While perusing books in the library, I was drinking a soda and wondered, “Do I read more books when I drink soda?” Now, I drink a lot of soda and I read a lot of books, but I don’t know if they are related. So, I asked a bunch of people how much soda they drank and how many books they read. I am looking for a *correlation* between the two. This means that I need to find the **Pearson correlation coefficient** between sodas drank and books read. In statistics, the Pearson correlation coefficient reveals this relationship.

## What Is Correlation?

A correlation is a mathematical relationship between two variables. There are several types of correlation including positive, negative, and no correlation. Let’s take a look at some of the types below.

In this scatter plot of the independent variable (X) and the dependent variable (Y), the points follow a generally upward trend. This is known as a *positive* correlation. If we were to graph a line of best fit, then we would notice that the line has a positive slope. Sometimes, this called a *direct relationship*.

The points in this scatter plot have a negative correlation. We know this because the points follow a generally downward trend. If we were to determine the line of best fit, we would get a negative slope. This is generally interpreted as one variable increases while the other decreases. Sometimes, this relationship is referred to as an *indirect relationship*.

This scatter plot represents a very low or no correlation. Notice that there is no apparent rhyme or reason to the data; it is just scattered. If we were to determine the line of best fit for this data, then the line would have a slope very close to zero.

There are other types of correlation such as quadratic and partial correlations when there are more than two variables.

## The Pearson Correlation Coefficient

The Pearson correlation coefficient (usually just referred to as correlation coefficient) is the numerical correlation between a dependent and independent variable. It results from analyzing the difference between X and Y – the independent and dependent variable, respectively – and the proposed mean.

The overall equation to calculate the Pearson correlation coefficient is

Don’t be overwhelmed by the equation. I know that there are a lot of letters there, but each of them has a purpose and meaning behind it.

The numerator represents the deviation present in the sample with X and Y *together*, while the denominator represents the deviation in X and Y *individually*. These deviations give us an idea of how the deviation in X relates to the deviation in Y.

### Values of r

The Pearson correlation coefficient is represented by the letter *r*. It ranges from -1.0 to +1.0. In fact, if your value for *r* is greater then 1 or less than -1, then you probably need to check your math.

If *r* is greater than zero (*r* > 0), then you have a positive correlation. If *r* is less than zero (*r* < 0), then you have (you guessed it) a negative correlation. As the value of *r* approaches zero, the correlation becomes weaker. This is because the slope of the line is closer to zero. A value of +/- 1 indicates a perfect fit, in others words the data all fit on the line perfectly. In the social sciences, this rarely happens.

A word of warning – *r* is *not* the slope of the line of best fit. However, it does help us determine the slope according to the formula *a = r(s _{y}/s_{x})*.

### Example

Let’s take a look at the following data

When you get the data, the first thing that you should do (if possible) is to graph it. Our data looks like

When we take a first look at the data and graph, we notice a generally upward trend. This implies that the data have a positive correlation. If we plug the data into the equation for the Pearson correlation coefficient we get a value of *r* = 0.92.

This represents a very high correlation in the data. We can tell when the correlation is high because the data points hover closely to the line of best fit (seen in red).

Generally, a value of *r* greater than 0.7 is considered a strong correlation. Anything between 0.5 and 0.7 is a moderate correlation, and anything less than 0.4 is considered a weak or no correlation.

### Coefficient of Determination

*r* is often used to calculate the **coefficient of determination**. This is represented by *r ^{2}*. It is calculated by (surprise, surprise) squaring

*r*.

*r ^{2}* is unique because it reveals how much one variable explains another. More specifically,

*r*describes how much the variance in one variable explains the variance in another.

^{2}However, you must know that, just like *r*, the coefficient of determination is *not* a slope. And it is not used to calculate the slope.

## The Takeaway

The Pearson correlation coefficient is a numerical expression of the relationship between two variables. It can vary from -1.0 to +1.0, and the closer it is to -1.0 or +1.0 the stronger the correlation. *r* is not the slope of the line of best fit, but it is used to calculate it. I can’t wait to see your questions below! Happy statistics!

## Comments are closed.