Statisticians point out that it’s often useful to “chunk” data to understand it. What does it mean to “chunk” data? It means dividing a long list into smaller chunks so that, with a few well-chosen numbers, we can get a sense of the layout of the list.
The fundamental “chunking” number is the median. The median is the middle of the list: that is, it divides the list into two chunks: an upper list and a lower list. This one number, the median, tells you both the maximum of the lower list and the minimum of the upper list.
Quartiles
Quartiles extend this idea. First, find the median, which divides the entire list into a “top 50%” list and a “bottom 50%.” Now, find the medians of each one of these lists. The median of the “bottom 50%” called Q1, the first quartile. The median of the “top 50%” is called the third quartile. The quartiles are called “quartiles” because the two quartiles and the median nicely divide the list into four equal chunks.
- the lowest 25% of the list is below the first quartile
- the next 25% of the list is between the first quartile and the median
- the next 25% of the list is between the median and third quartile
- the highest 25% is above the third quartile.
Notice that, we don’t use the term “second quartile” because the median plays the role of the second quartile.
The Interquartile Range
Often, statisticians are bothered by outliers, that is, extreme high or low values. An outlier is a member on the list who is not representative of most of the list. In the list of household incomes in the US, the incomes of Bill Gates and Warren Buffett are not representative of the rest of us: they are outliers. Outliers, by definition, will always be at the very top or the very bottom of a list.
Notice that both the “top 50%” and the “bottom 50%” will necessarily contain any outliers. Would it be possible to talk about a “half” of the population that definitely contains no outliers? Well, instead of the “top 50%” or the “bottom 50%”, we could take the “middle 50%“. What’s that? Well, suppose we look at all the folks between the first quartile and third quartile. We know that a quarter of the population is between the first quartile and the median, and a quarter between the median and the third quartile, so between the first quartile and the third quartile is 50% of the population, and it’s the 50% that’s in the middle of the population. This is called the interquartile range: the set of data entries from the first quartile to the third quartile. It’s a big deal because it’s not the upper half or lower half but rather the middle half of the data. For this reason, statisticians feel it gives a very good representation where the typical data lie.
An Example with Real Data
Consider the geographic size of countries. On planet Earth, what is the size of a typical country? Well, if we list the countries and their areas, we find the maximum is Russia (16,995,800 sq km) and the minimum is the Holy See (0.44 sq km). Obviously, neither one of those is typical of the area of a country.
The median value on the list is 50660 sq km (Costa Rica). So that’s interesting: half the countries on Earth have more area than Costa Rica, and half have less. Incidentally, the US State of West Virginia is slightly bigger than this, so little old West Virginia has more area than half the countries on Earth. Who would have thought that? 🙂
The third quartile is 325360 sq km (Vietnam) and the first quartile is 572 sq km (the Isle of Man). So, even within the interquartile range, there’s huge variation from 572 up to 325360. Still, we can say half the countries on Earth have more area than the Isle of Man but less area than Vietnam. That’s where the middle 50% lies. That would be, in many ways, the most representative range for the size of a “typical” country.
Leave a Reply