One category of graph you certainly could see on GRE Data Interpretation questions is the scatterplot, and its associated idea of the best fit line. Let’s talk about how these beasts operate!
To begin, let’s review scatterplots. When each data point (each person, each car, each company, etc.) gives you a value for two different variables, then you can graph each data point on a scatterplot. Here’s an example. Suppose we survey ten students who came from the same high school to the same college. We ask each student for their total SAT score (M + CR + W) and their GPA in the first semester of their freshman year in college. Each student appears below as a single dot, the location of which shows that student’s SAT score and first semester GPA.
As one would expect, there’s a general “upward” trend: students with higher SAT scores tended to perform better in their freshman year of college. At the same time, there’s some chance variation: right in the middle, three students all scored in the 1700’s on their SATs but, for whatever reasons, had different results in the first semester of their freshman year.
A Best Fit Line
We see there’s a general “upward” pattern to this scatterplot. Suppose we wanted to make a prediction based on that pattern. For example, a current high school senior in this high school, planning to attend this same college, would know her SAT score and might be curious about her predicted GPA in her upcoming freshman year of college.
We formalize this pattern by drawing what is sometimes called a “best fit” line. Excel calls this a “trendline.” The official name in Statistics is the Least-Squares Regression Line, but you don’t need to know that. Nor do you need to understand the mathematical details of why this line, as opposed to any other possible line, is in fact the “best fit.”
Here’s the same graph with a best fit line.
The best fit line abstracts a common pattern from the individual data points. The best fit line represents the expected relationship: if we know a new student’s SAT score, then, on average, what would we predict for that student’s first semester college GPA? One student appears almost exactly on the best fit line (sometimes a data point or two will be on the trendline, and sometimes none will be); in this case, we can say that student’s GPA is more or less what we would expect from her SAT score. There are five dots clearly above the best fit line: these five students had higher GPA’s than what we would have predicted from their SAT scores. Four dots are below the line: those four students had first semester GPA’s lower than what we would expect, given their SAT score. Notice that questions of the form “how many individuals had a higher/lower (y-value) than what we would expect from their (x-value)?” are simply asking you to count dots above or below the best fit line.
We also need to make a distinction between people or data used to generate the line, and the new data points predicted by the line. In this case, we used 10 people to generate the best fit line. We have no predictions to make about those 10 people: both their SAT scores and first semester GPA’s are known, now things of the past. If we are asked for the now-completed first semester GPA of the person who had a 1780 SAT score, we look for that dot: that’s the low dot in the middle of the graph, with a value of 2.7 for the GPA (too much first semester partying for that person?) A very different question is: suppose a new person, a high school senior, has a 1780 SAT score and would like to predict her first semester college GPA. For a prediction, we are looking not at any individual point but at the line: the line has a y-coordinate of about 3.2 there, so, on average, we would predict GPA of about 3.2 for this current high school senior.
The past are the dots, the future is the line.
Here’s a practice question to test your understanding of the best fit line: http://gre.magoosh.com/questions/2290