In statistics, you deal with a lot of data. The hard part is finding patterns that fit the data. To look for patterns, there are several statistical tools that help identify these patterns. But before you use any of these tools, you should look for basic patterns. You can identify basic patterns using a scatter plot and correlation.
A scatter plot is a map of a bivariate distribution. This means that it is a map of two variables (typically labeled as X and Y) that are paired with each other. You do this because you have some sort of logical reason for connecting the two variables to look for a relationship between them.
For example, let’s say that you are measuring a person’s weight and the amount of water that they drink. In this case, the variables are paired by person. Or say that you want to identify the relationship between the amount of urea in urine and the water pressure of someone’s bladder – see, statistics isn’t limited to math and the social sciences lol.
When we decide that one variable is X and the other is Y, we map the data along a Cartesian plane. If you were to this for a collection of measurements, you get something that looks a little like this
Already, you can spot that the data seems to follow the general pattern that as osmotic pressure of the bladder continues, the amount of urea increases as well. You have already identified a pattern known as correlation.
There are three ways that data can correlate: positive, negative, and zero.
Positive correlation is when the scatter plot takes a generally upward trend. Sometimes positive correlation is referred to as a direct correlation. Your urea plot is an example of positive correlation. It also means that the line of best fit has a positive slope.
Negative correlations, you guessed it, have a generally downward trend in the scatter plot. This kind of pattern can also be referred to as an indirect relationship. This means that the slope of the line of best fit has a negative slope. The amount of vitamin C in cabbage shows a negative relationship to head size.
Zero correlation is also referred to as no correlation. This means that the pattern has no discernible pattern. This usually implies that the two variables are unrelated. In this case, the value of the slope of the line of best fit would be very close to zero. The log of a brain protein shows a zero correlation with intra vein size, as you can see below
Coefficient of Determination
Now, I been using the word ‘slope’ to refer to the line of best fit, but that does not really tell you the strength of the correlation. To determine the strength of the correlation, the correlation coefficient is best.
The correlation coefficient, r, represents the comparison of the variance of X to the variance of Y. The coefficient of determination, r2, gives you an impression of how much of the variation in X explains the variation in Y.
Both r and r2 vary between -1.0 and +1.0. The closer the value is to the absolute value of 1, the stronger the correlation is.
Scatter plots are a method of mapping one variable compared to another. This map allows you to see the relationship that exists between the two variables. The relationship can vary as positive, negative, or zero. The relationship is numerically represented by the correlation coefficient and the coefficient of determination.
I hope that this post has helped a bit and I look forward to seeing your questions below. Happy statistics!