The post Making Sense of Time Series Data appeared first on Magoosh Statistics Blog.

]]>For a quick overview of the topic, you might want to check out Time Series Analysis and Forecasting Definition and Examples first.

Data changes over time. For example, if it’s sunny and 75 degrees outside today, that doesn’t mean that I can expect such nice weather every day. In fact, temperatures might drop to 60 and we could have rain by the end of the month. Six months from now, it could be snowing and a chilly 20 degrees outside!

Time series data may vary due to a number of different reasons.

- We know that temperatures vary with the seasons, generally warm in the Summer and cold in the Winter. This kind of change in the data is called
**seasonal variation**. -
Sometimes change takes place over longer time periods. For instance, the US economy seems to go through periods of expansion and recession once every decade or so. This is a
**cyclical variation**. -
Extremely long-term movement of the data, after all short-term fluctuations are averaged out, is the
**trend**of the time series. Even though the data may show regular ups and downs throughout its lifetime, there could still be an overall upward or downward trend. -
Finally, the one component of variation that we cannot easily control is the
**noise**. Noise in the data arises from a combination of random fluctuations and any other changes in the trend that cannot be accounted for as cyclical or seasonal oscillations. The level of noise will affect the certainty of future predections: the more noise, the less sure we can be of our forecasts.

Each source of variation has its own data series associated to it. Let’s use *Y _{t}* for the original time series data.

- Trend factor:
*T*_{t} - Cyclic factor:
*C*_{t} - Seasonal factor:
*S*_{t} - Noise factor:
*N*_{t}

(Not all factors may be required for a particular time series.)

Then *Y _{t}* is the product of the individual factors:

*Y _{t}* =

Statisticians have developed sophisticated ways to isolate each of the factors. Although you can always use technology to analyze time series data and create a **forecast** (predictions into the future), it’s helpful to know a bit about how the process works.

(For a step-by-step introduction into Excel forecasting, check out: Understanding Time Series Forecasting in Excel.)

There are a number of major components to the analysis:

- Estimation of the Trend (
*deseasonalization*and*regression*) - Estimating the Seasonal Variation and Cyclical variation (
*seasonal index*, etc.) - Forecasting

Let’s spend a little time getting to know some of the basic techniques now.

How do you account for seasonal or cyclical variation and isolate the underlying trend? First we have to determine the **period(s)** of the oscillation(s).

For example, if the time series data represents monthly mean temperatures, then the seasonal period should be 12. We expect February temperatures to be closer to those from last February (12 months previous) than to those from January (one month previous).

There are four main steps:

- Compute a series of
*moving averages*using as many terms as are in the period of the oscillation. If the period is odd, then this is a simple average. But if the period is even, then you need a**centered moving average**. - Divide the original data
*Y*by the results from step 1._{t} - Compute the
*average seasonal factors*. - Finally, divide
*Y*by the_{t}*(adjusted) seasonal factors*to obtain**deseasonalized data**.

Let’s see how it works! Here is some sample data to work with. Below, you’ll find out how to deseasonalize the data.

This data is clearly affected by season. Let’s isolate oscillations of period 4.

The first step is to create a column for moving averages. Note that your moving averages should be placed in the center of the period that you are working with. That means you would not have enough data to begin a moving average computation until halfway through the first period. Also, you can’t compute moving averages that go beyond the last half of the period in the data set.

Now since our period is even (4), we will compute centered moving averages (CMA). Skip the first two rows, and begin on the third row (Summer 2014). Take the CMA using an average of averages:

[ (1201 + 1053 + 830 + 979)/4 + (1053 + 830 + 979 + 1221)/4 ]/2 = 1018.25

(In Excel, you can use the AVERAGE function to save a lot of time!)

Then, in the next column, divide the original data by the CMA. Here’s what it should look like so far:

Now you can compute the **seasonal index**, which is an average of the seasonal factors for each season (e.g. month, quarter, day, etc.).

So in our example, we have Fall seasonal factors of (roughly) 0.815, 0.821, and 0.832. Take the average of these to get the Fall seasonal index for enrollment: 0.823. Do the Same for Winter, Spring, and Summer. The seasonal indexes are highlighted in color below. Copy and paste these indexes throughout the column.

Finally, divide the original data by the seasonal index to get **deseasonalized** data. This data represents the overall movement of the time series with seasonal effect smoothed out. Typically you would perform a regression on this data to predict the trendline and make forecasts.

The purpose of this article is to explain time series data itself, but a natural next step would be to discuss **forecasting**. Without getting into the details, you can think of forecasting as continuing the trendline into the future and then factoring back in the seasonal/cyclic components. Along with an estimate of how noise will affect the certainty of our predictions, the forecasting methods provide powerful tools for many applications.

Maybe it’s time to play the stock market and see how good our predictions might be!

The post Making Sense of Time Series Data appeared first on Magoosh Statistics Blog.

]]>The post Time Series Analysis and Forecasting Definition and Examples appeared first on Magoosh Statistics Blog.

]]>A **times series** is a set of data recorded at regular times. For example, you might record the outdoor temperature at noon every day for a year.

The movement of the data over time may be due to many independent factors.

**Long term trend**: the overall movement or general direction of the data, ignoring any short term effects such as cyclical or seasonal variations. For example, the enrollment trend at a particular university may be a steady climb on average over the past 100 years. This trend may be present despite having a few years of loss or stagnant enrollment followed by years of rapid growth.**Cyclical Movements**: Relatively long term patterns of oscillation in the data. These cycles may take many years to play out. There is a various cycles in business economics, some taking 6 years, others taking half a century or more.**Seasonal Variation**: Predictable patterns of ups and downs that occur within a single year and repeat year after year. Temperatures typically show seasonal variation, dropping in the Fall and Winter and rising again in the Spring and Summer.**Noise**: Every set of data has noise. These are random fluctuations or variations due to uncontrolled factors.

Each factor has an associated data series:

- Trend factor:
*T*_{t} - Cyclic factor:
*C*_{t} - Seasonal factor:
*S*_{t} - Noise factor:
*N*_{t}

Finally, the original data series, *Y _{t}*, consists of the product of the individual factors.

*Y _{t}* =

Often only one of the oscillating factors, *C _{t}* or

There are precise mathematical methods for teasing apart the individual factors from a given time series, but that’s a topic for another day!

The idea behind **forecasting** is to predict future values of data based on what happened before. It’s not a perfect science, because there are typically many factors outside of our control which could affect the future values substantially. The further into the future you want to forecast, the less certain you can be of your prediction.

Just look at weather reporting! Figuring out if it will rain tomorrow is not too difficult, but it’s virtually impossible to predict if it will rain exactly a month from now.

Basically, the theory behind a forecast is as follows.

- Smooth out all of the cyclical, seasonal, and noise components so that only the overall trend remains.
- Find an appropriate regression model for the trend. Simple linear regression often does the trick nicely. (Check out Introduction to Regression Analysis for more on that topic.)
- Estimate the cyclical and seasonal variations of the original data.
- Factor the cyclical and seasonal variations back into the regression model.
- Obtain estimates of error (confidence intervals). The larger the noise factor, the less certain the forecasted data will be.

Most statistical software can perform a time series forecast. Even Excel has this feature — see Understanding Time Series Forecasting in Excel, for example.

Forecasting time series data allows you to make predictions of future events. While the theory and methods can be a bit complicated, the basic idea is to extend the underlying trend together with the predictable ups and downs already present in the data.

The post Time Series Analysis and Forecasting Definition and Examples appeared first on Magoosh Statistics Blog.

]]>The post Analysis of Covariance (ANCOVA): An Overview appeared first on Magoosh Statistics Blog.

]]>One of my favorite things to say to my students is “Adults look for solutions, not excuses.” Well, in statistics, we often use the error of a particular method as an excuse for why our model is different from the data. But there is a way to explain the error (a solution, if you will). Sometimes other variables explain some of the error. We can analyze the influence of these terms using a method called **analysis of covariance** or **ANCOVA**.

There are many ways to determine the effect that an independent variable has on a dependent variable. For example, linear regression can help determine the effect that multiple predictors have on an outcome. ANOVA even helps determine whether multiple treatments are truly different. Although each of those has benefits, they also have a couple of limitations.

Linear regression is best when you have continuous variables like age and income. ANOVA is great if you have categorical variables like level of education and its effect on income. But what if you wanted to know the effect of age *and* level of education on income? That’s is where ANCOVA really shines.

Analysis of covariance is a statistical method that determines how much an independent variable (or treatments levels) explains an outcome and how much **covariates** explain the error. A covariate is a variable that you think has an effect on the outcome but is not responsible for the outcome like the independent variable.

For example, let’s say that I want to know the effect that level of education (like high school, bachelors, master, and doctorate) has on income. With four levels of the independent variable, it seems ripe for ANOVA. But age may also make a difference since people tend to earn more money as they get older. Age would be an example of a *covariate*. I think that it has an effect, but is not responsible for income.

The role of the covariate is to help explain away some of the error in our analysis. In essence, the original analysis is to determine if the level of education is responsible for differences in income. According to general statistics, there will be some error in my measurements within different educational groups.

Including age as a covariate, however, explains some of the error. It doesn’t explain income, but it explains why my measures may be a little off. This way, I am explaining some of the error in the overall method. Since it may affect both education and income, I include it as a covariate to explain some more of the variance on the outcome without messing with the predictor.

Of course, this assumes that age is independent of income. What that means is that one’s age does not explain income in first place. This means that the variance in age has a low correlation with the variance in income. There may be some, but it should not be a lot (i.e., less than 0.3)

Ok, you’re reading a research article and you stumble across this little gem…

*One-way ANCOVA determined that there is a significant effect of levels of education on income, F(3, 26) = 4.96, p < 0.05, after controlling for age.*

There is a lot of valuable information in this one little sentence.

The first part, *effect of level of education on income*, states the independent variable (education) and the dependent variable (income). This statement comes from the hypothesis that if different levels of education have an effect, then income will be different.

The second part contains the actual test statistic, *F(3, 26) = 4.96, p < 0.05 *. The *F* means that we are using the F-ratio as our test statistic. The 3 refers to the degrees of freedom in our independent variable, which is the level of the independent variable minus 1. The 26 refers to the degrees of freedom in each of our groups, which is the number of participants minus 1. The 4.96 is the actual value of the ratio and the *p < 0.05* tells us the level of significance.

The third part is significant. This part tells us the covariate. Furthermore, it states that once we control for it, then the effect of education level is still significant. This means that age explains some of the error even though it is a continuous variable.

ANCOVA is a method that allows you to take into account that some error in your analysis is measurable. Not only that, that source of error has an effect on the dependent variable. This is beneficial because it allows you more control over an experimental or nonexperimental study. It has advantages over other linear regression models because it can incorporate categorical and continuous variables to determine an overall effect. I hope to see any questions you may have. Happy statistics!

The post Analysis of Covariance (ANCOVA): An Overview appeared first on Magoosh Statistics Blog.

]]>