Comic by Randall Munroe
Box plots, or box-and-whisker plots, are fantastic little graphs that give you a lot of statistical information in a cute little square. Let’s take a look at the little guy.
One wicked awesome thing about box plots is that they contain every measure of central tendency in a neat little package. Recall that the measures of central tendency include the mean, median, and mode of the data. It also shows a few other pieces of data.
The Basics of the Boxplot
First, let’s look at a boxplot using some data on dogwood trees that I found and supplemented.
On the graph, the vertical line inside the yellow box represents the median value of the data set. In this case, it is 70 inches. The dot beside the line, but still inside the yellow box represents the mean value of the data. The mean value of the data may not always be an actual value in the data. Remember from long, long ago in a post far, far away that the mean is actually a statistical model that represents the data.
Just so you know, in a typical data set without supplemented data, you may not see that little dot because it should be close to the median value. Hence the reason I supplemented the data.
Now, the yellow part. It represents 50% of data points between the 1st and 3rd quartiles. We’ll talk about how really useful this box is in just a minute. I promise.
The line at the furthest left represents the lowest value in the data. In the case of our trees, the smallest is about 30 inches tall. Aww, poor thing. The line at the furthest right represents the highest value in the data. So our tallest normal tree is a whopping 110 inches.
These lines give you an idea of the range of the data. These lines were formerly called the whiskers of the plot, but statisticians have since reduced the name to the much less adorable boxplot.
Image by FunnyCatSite.Com
Now we come to that little open dot at the very furthest right. It represents an outlier. This means that this particular data point is unusual and does not fit the data set for some reason. If we were conducting some sort of study, we could say that this tree is statistically different from the other trees by assigning it a z-score.
What the Boxplot Means
Now that we have discussed how to read the boxplot, let talk about how to interpret it like really good stats students! Let’s take a look at something more interesting than trees… date night! We are going to look at how much of the total bill men and women pay on a given date on common date nights.
First, notice that there are two sets of boxplots: one for males and one for females. Boxplots make comparing the measures of data much more efficient. It is easy to see that males and females typically spend on average different amounts on the total bill for date night except on Saturday.
Second, given the much longer “whiskers” for men, we can interpret that they vary more widely in the amount of money that they spend on the date while women tend to center more toward the average except on Saturday night.
Third is the skew of the data. Skew refers to the asymmetry of your data. If you look at the women for Saturday night, the box and whiskers are pretty even on either side of the median/mean. However, 75% of the data for the men on Friday night is less than $25 of the total bill, but the upper 25% spend up to $40 of the total bill. This data is skewed
Finally, we look for outliers. They represent the statistically different data points. While most nights have an outlier, we notice that women have a few more on Thursdays, so men, be prepared.
Boxplots are useful little graphics that contain a lot of information in a very little space. They are best used at the beginning of data analysis to identify early patterns in the data. Although, as we have seen here, they are useful for reporting results in clear and concise ways. Happy boxplotting!