# Basics of Principal Component Analysis

Imagine that you are looking at a pile of leaves in the forest. How would you go about figuring out which leaves came from which tree? You look for the leaves that are similar and then look for a tree that went with them. Principal component analysis is a statistical technique for doing the same thing with data. You try to find which items go together because they are the result of something we can’t observe directly, the tree if you will.

## Factors

Before we get too deep in the forest, we need to get some terms in order. The first is factors. Factors are underlying concepts or perceptions that you cannot directly observe, so you observe their effects on different test or surveys.

For example, I cannot directly observe how teachers engage their students in a class. I design a survey that teachers take made of items that represent what research says engagement should like in a class. In my line of work, it was a survey of about 20 questions.

The factor that I want to measure is student engagement. The survey items represent the effect of engagement. In a scientific sense, I think of the factor as an independent variable and the items as a dependent variable.

### Factor Analysis

Factor analysis is a technique that looks for correlation between items. If I had a survey of 6 items, the correlations might look like this.

Notice that items x1 – x3 have high correlations with each other while x4 – x6 have high correlations with each other but not x1 – x3. Logically, items x1 – x3 are related and x4 – x6 are related. This means that there are two separate concepts that the items are measuring. Those concepts are the factors

However, simple factor analysis does not take some things into account. For example, it only analyzes the data itself, it does not take into account the covariance of the items.

## Principal Component Analysis

Principal component analysis is the more mature and robust (a.k.a., thorough) version of factor analysis. Instead of simply looking for correlations of the items, it looks at correlations between the variance of the items.

Variance between the items helps explain how the items at related. For example, if x1 and x4 have a low covariance, then changes in x1 do not explain the changes in x4 very well. In terms of our analysis, they would not load on the same factor very well.

Principal component analysis has an advantage over traditional factor analysis because it takes into account that the variable may explain each other and how well they do that. If the variances are related, then it makes sense that the items are related.

### Eigenvalues

A term that comes up in a principal component analysis is eigenvalues. Although it is a funny sounding word, it has a very practical significance it principal component analysis.

Eigenvalues are a mathematical method of manipulating a matrix during linear transformation. An exact mathematical definition is beyond the scope of this post, but you can definitely find out more technical explanations here. Thankfully, statistics programs generate eigenvlaues for us, so I want to focus on the practical side of using eigenvalues rather than a lengthy technical definition.

For our purposes, an eigenvalue results from a manipulation of the matrix of variances from the items. If you look back to our correlation matrix, we could build a similar one of the variances between items. The eigenvalue would be the degree to a factor explains the variance between items.

If your factor has an eigenvalue of 3.92, then that means that the factor explains 39.2% of the variance between items. This means that the factor has a noticeable effect on people’s responses to the items.

If the factor has an eigenvalue of 0.39, then that factor only explains 3.9% of the variance between items. That’s not very good at all, so the factor is not really affecting people’s responses. That factor is probably not really there or is measured very, very, poorly.

The rule-of-thumb cutoff is an eigenvalue of 1.0. This would mean that the factor explains 10% of the variance in the items that load on that factor. But don’t take this as a hard and fast rule.

### Multiple Factors

During principal component analysis, there is oftentimes more than one factor present. This would mean that the items measure more than one construct.

When we look at our original correlation matrix, we see that some items correlate well with each other. This usually indicates that the variances will also correlate. We recognize this in the analysis by looking at the number of eigenvalues that are greater than one.

In the case of our items, one factor has an eigenvalue of 3.92 while a second factor has an eigenvalue of 1.01. There could be as many factors as items, but if we see the third factor as having an eigenvalue of 0.88, then it is not likely that the six items measure three factors.

### Limitation

One severe limitation of principal component analysis is that it doesn’t take error into account. Each measure would have its own variance, but also its own error. Principal component analysis assumes that the error of the measurements is zero. Logically and practically, we know this to be false, so each analysis has this grain of salt to consider.

## Takeaway

Principal component analysis is a way of looking for the underlying structure of the data. Certain variables cannot be measured directly, so we measure the effect and work backward to the variable. Principal component analysis determines these factors using a matrix of variances instead of just the raw data. Eigenvalues help determine the number of factors present in the data. Many times, there is more than one factor present, and eigenvalues help separate them.

There is a lot more to all kinds of factor analysis, so I hope to see your questions soon. Happy statistics!