# Pearson Correlation Coefficient

While perusing books in the library, I was drinking a soda and wondered, “Do I read more books when I drink soda?” Now, I drink a lot of soda and I read a lot of books, but I don’t know if they are related. So, I asked a bunch of people how much soda they drank and how many books they read. I am looking for a correlation between the two. This means that I need to find the Pearson correlation coefficient between sodas drank and books read. In statistics, the Pearson correlation coefficient reveals this relationship.

## What Is Correlation?

A correlation is a mathematical relationship between two variables. There are several types of correlation including positive, negative, and no correlation. Let’s take a look at some of the types below.

In this scatter plot of the independent variable (X) and the dependent variable (Y), the points follow a generally upward trend. This is known as a positive correlation. If we were to graph a line of best fit, then we would notice that the line has a positive slope. Sometimes, this called a direct relationship.

The points in this scatter plot have a negative correlation. We know this because the points follow a generally downward trend. If we were to determine the line of best fit, we would get a negative slope. This is generally interpreted as one variable increases while the other decreases. Sometimes, this relationship is referred to as an indirect relationship.

This scatter plot represents a very low or no correlation. Notice that there is no apparent rhyme or reason to the data; it is just scattered. If we were to determine the line of best fit for this data, then the line would have a slope very close to zero.

There are other types of correlation such as quadratic and partial correlations when there are more than two variables.

## The Pearson Correlation Coefficient

The Pearson correlation coefficient (usually just referred to as correlation coefficient) is the numerical correlation between a dependent and independent variable. It results from analyzing the difference between X and Y – the independent and dependent variable, respectively – and the proposed mean.

The overall equation to calculate the Pearson correlation coefficient is

Don’t be overwhelmed by the equation. I know that there are a lot of letters there, but each of them has a purpose and meaning behind it.

The numerator represents the deviation present in the sample with X and Y together, while the denominator represents the deviation in X and Y individually. These deviations give us an idea of how the deviation in X relates to the deviation in Y.

### Values of r

The Pearson correlation coefficient is represented by the letter r. It ranges from -1.0 to +1.0. In fact, if your value for r is greater then 1 or less than -1, then you probably need to check your math.

If r is greater than zero (r > 0), then you have a positive correlation. If r is less than zero (r < 0), then you have (you guessed it) a negative correlation. As the value of r approaches zero, the correlation becomes weaker. This is because the slope of the line is closer to zero. A value of +/- 1 indicates a perfect fit, in others words the data all fit on the line perfectly. In the social sciences, this rarely happens.

A word of warning – r is not the slope of the line of best fit. However, it does help us determine the slope according to the formula a = r(sy/sx).

### Example

Let’s take a look at the following data

When you get the data, the first thing that you should do (if possible) is to graph it. Our data looks like

When we take a first look at the data and graph, we notice a generally upward trend. This implies that the data have a positive correlation. If we plug the data into the equation for the Pearson correlation coefficient we get a value of r = 0.92.

This represents a very high correlation in the data. We can tell when the correlation is high because the data points hover closely to the line of best fit (seen in red).

Generally, a value of r greater than 0.7 is considered a strong correlation. Anything between 0.5 and 0.7 is a moderate correlation, and anything less than 0.4 is considered a weak or no correlation.

### Coefficient of Determination

r is often used to calculate the coefficient of determination. This is represented by r2. It is calculated by (surprise, surprise) squaring r.

r2 is unique because it reveals how much one variable explains another. More specifically, r2 describes how much the variance in one variable explains the variance in another.

However, you must know that, just like r, the coefficient of determination is not a slope. And it is not used to calculate the slope.

## The Takeaway

The Pearson correlation coefficient is a numerical expression of the relationship between two variables. It can vary from -1.0 to +1.0, and the closer it is to -1.0 or +1.0 the stronger the correlation. r is not the slope of the line of best fit, but it is used to calculate it. I can’t wait to see your questions below! Happy statistics!