The post Intro to R Statistics and Data Analysis appeared first on Magoosh Statistics Blog.

]]>Probably one of the best things about R is that it is *free* to download and use. Not only that, thousands of programmers and statisticians are constantly tweaking it and adding to it for our benefit. This means that R is constantly updating and adding features. R can handle just about any statistical method that you need, and if it doesn’t, someone is working on way for it to do that.

Now, the R program itself is free and available here. Personally, I download the most recent version every few months because the contributors to R are constantly updating it. Once you download the version (Mac, PC, Linux) that you need, you just open it up and you are ready to go.

Once you open it, you see a command console similar to any terminal in any operating system. That is all there is to R, no buttons or drop down menus to mess with. However, this programming style makes it sleek and cool looking, but very confusing. That is why I use something called R Studio.

R Studio is a user interface that keeps everything in easy-to-find places. It is a separate program, but it opens up R in the background so you don’t have to have 2 open programs at once. There are a few key parts that you should be familiar with. For our example, I have loaded some of my notes from a class I took in item response theory (my second favorite set of statistical methods).

The console is where are the real work is done. In this section, you type all your commands and programs that you want to conduct. Everything that is run in R is typed in the console.

The section above the console is the script editor. You can type your code up here and run it. The benefit of the script editor is that you can easily manipulate the code that you are writing and change things without having to type commands in the console over and over again. Another benefit of the script editor is that you can add notes to yourself. For example, I have typed some of my initial notes about R directly into the script, which I save for later use.

The next part is the global environment section in the upper right hand corner. This shows you all of the data sets and variables that you are using. If you do not use R Studio, you would have to keep track of these things yourself. It also contains a history of every command that you have coded and ran. This is a very useful feature when you are starting a statistics session over again.

In the bottom right hand corner, there is the file, plots, and packages window. The files tab allows you to look through the files in the working directory or on the rest of your computer. The packages tab is where you load up different packages that contain the specific statistical tests that you may need during the all-night stats-a-thon you’ll be doing. The plots tab is where you can view the variety of plots that you may be generating as part of your analyses.

R is an object-oriented program. This means that it works with objects that you have loaded into it or created within it. A variable is an example of an object, but so is a matrix or vector. In fact, you can load a whole data set as an object.

The benefit of being an object-oriented language is that you can perform operations between objects. The most useful tool that I find myself using is loading multiple data sets to analyze them for patterns. In my experience, no other program handles multiple data sets as well as R does.

R does some analyses on its own, such as linear regressions and tests like t-test or ANOVA. But it doesn’t innately run tests like mediation or moderation. To do these higher level analyses, you have to install and load packages.

A package is something that contains a set of features that are specific to your needs. For example, when running analyses for item response theory, I use a package called ‘CTT’ and ‘psychometric’. These packages contain the programs that are required to conduct these analyses. There are packages for almost every statistical method. The best place to find out what packages that you need is sites like R Bloggers and Stack Overflow.

The most intimidating part of R is the lack of preset menus and commands like SPSS or Stata. You have to be explicit in telling what R what you want to do. This means that you will have to use commands similar to programming in something like html or Python.

Sometimes the coding simple

V1 <- 5 * (1/5)

In R you can make extensive use of programming commands such as ifelse or the dreaded but useful loops. For example, changing a variable is a simple matter of a command like

V_new <- ifesle(V1 == 1, 5, N/A)

This is interpreted as "Create a new variable, V_new, such that if V1 = 1, then change it to 5. If V does not equal 5, mark it as N/A.

Loops are created in similar ways.

for (year in c(2010,2011,2012,2013,2014,2015)){

print(paste("The year is", year))

}

This is a simple loop that prints out the year for various years. However, just as in other programming languages, you must be careful of creating loops that don't stop or do exactly what you want. We are not going to get deep into the programming language itself, but check this out if you want to start.

R is an amazingly expansive program that is capable of many statistical techniques from simple correlations to structural equation modeling and more. It is a free program (so is R Studio) that you can almost infinitely customize to meet your needs. Being an object-oriented language, you can manipulate multiple data sets and object with ease. Packages allow you to access almost any statistical test that you may ever need. While the initial learning curve for the programming language is pretty steep, it is amazing versatile once you get it. I hope this post helps give you an idea of what R is and can do. I look forward to seeing your questions below. Happy statistics!

P.S. I need to let you know that I am not some sort of red-belt in R. I know a lot about it and how to use it, but this post is only meant to serve as an introduction to what R is and what it can do. if you are really interested in what it can do and how to use it, I recommend R Bloggers and books like the R Cookbook.

The post Intro to R Statistics and Data Analysis appeared first on Magoosh Statistics Blog.

]]>The post Sampling and Sampling Distributions appeared first on Magoosh Statistics Blog.

]]>The overall goal of statistics is to determine patterns represented in a *sample *that reflect patterns that may exist in the *population*. The sample is a group of participants that reflect the make up of the population. To accomplish this, several types of sampling methods are used.

The gold standard of sampling techniques is the **random sample**. The goal of random sampling is to randomly select individual participants from the population. According to logic and simulated statistics, random samples limit the degree of *bias* and help to explain the error that is inherent in all statistics.

Of course, random does not mean that you arbitrarily select individuals. Instead it takes planning. First, define the population that you want to study. Second, identify every member of the population. Third, select mebers in such a way that every member has an equally likely chance of being chosen.

Another type of sampling is a **stratified random sample**. This kind of sampling accounts for differences in the population that may affect your analysis.

For example, let’s say that you want a random sample of a high school that is 25% seniors, 30% juniors, 23% sophomores, and 22% freshmen. The best way to get a random sample that reflects these differences is to make sure that your sample has the same percentages of each class. So a 100-person sample would have 25 seniors, 30 juniors, 23 sophomores, and 23 freshmen randomly selected from their respective classes. This kind of sample gives a much clearer picture of the overall population.

A **sampling distribution** represents the distribution of the statistics for a particular sample.

For example, a sampling distribution of the mean indicates the frequency with which specific occur. This means that the frequency of values is mapped out. You can also create distributions of other statistics, like the variance. Below is an example of a sampling distribution for the mean

The shape of the curve allows you to compare the *empirical* distribution of value to a *theoretical* distribution of values. A theoretical distribution is a distribution that is based on equations instead of empirical data. Two common theoretical distributions are Student’s t and the F-distribution.

The benefit of creating distributions is that the empirical ones can be compared to theoretical ones to identify differences or goodness of fit for the model. That is the ultimate goal of statistics, to create an empirical model that explains patterns in the data that differ significantly from the theoretical model.

Sampling involves selected participants from a population in order to identify possible patterns that exist in the data. There are several types of sampling, but the gold standard is random sampling. Sampling distributions represent the patterns that exist in the data. These patterns are then compared to theoretical ones to determine if the patterns differ significantly from the theoretical models.

I hope that this post help clarify sampling and sampling distributions. I look forward to seeing any questions that you have below. Happy statistics!

The post Sampling and Sampling Distributions appeared first on Magoosh Statistics Blog.

]]>The post Scatter Plots and Correlation appeared first on Magoosh Statistics Blog.

]]>A scatter plot is a map of a *bivariate distribution*. This means that it is a map of two variables (typically labeled as X and Y) that are paired with each other. You do this because you have some sort of logical reason for connecting the two variables to look for a relationship between them.

For example, let’s say that you are measuring a person’s weight and the amount of water that they drink. In this case, the variables are paired by person. Or say that you want to identify the relationship between the amount of urea in urine and the water pressure of someone’s bladder – see, statistics isn’t limited to math and the social sciences lol.

When we decide that one variable is X and the other is Y, we map the data along a Cartesian plane. If you were to this for a collection of measurements, you get something that looks a little like this

Already, you can spot that the data seems to follow the general pattern that as osmotic pressure of the bladder continues, the amount of urea increases as well. You have already identified a pattern known as correlation.

There are three ways that data can correlate: *positive, negative,* and *zero*.

Positive correlation is when the scatter plot takes a generally upward trend. Sometimes positive correlation is referred to as a direct correlation. Your urea plot is an example of positive correlation. It also means that the line of best fit has a positive slope.

Negative correlations, you guessed it, have a generally downward trend in the scatter plot. This kind of pattern can also be referred to as an indirect relationship. This means that the slope of the line of best fit has a negative slope. The amount of vitamin C in cabbage shows a negative relationship to head size.

Zero correlation is also referred to as no correlation. This means that the pattern has no discernible pattern. This usually implies that the two variables are unrelated. In this case, the value of the slope of the line of best fit would be very close to zero. The log of a brain protein shows a zero correlation with intra vein size, as you can see below

Now, I been using the word ‘slope’ to refer to the line of best fit, but that does not really tell you the strength of the correlation. To determine the strength of the correlation, the correlation coefficient is best.

The correlation coefficient, *r*, represents the comparison of the variance of X to the variance of Y. The coefficient of determination, *r*^{2}, gives you an impression of how much of the variation in X explains the variation in Y.

Both *r* and *r*^{2} vary between -1.0 and +1.0. The closer the value is to the absolute value of 1, the stronger the correlation is.

Scatter plots are a method of mapping one variable compared to another. This map allows you to see the relationship that exists between the two variables. The relationship can vary as positive, negative, or zero. The relationship is numerically represented by the correlation coefficient and the coefficient of determination.

I hope that this post has helped a bit and I look forward to seeing your questions below. Happy statistics!

The post Scatter Plots and Correlation appeared first on Magoosh Statistics Blog.

]]>The post Top 3 Most Useful Statistics Programs appeared first on Magoosh Statistics Blog.

]]>In the social sciences (like psychology and education), **SPSS** is considered the standard statistics program. In fact, it is usually the first piece of statistical software that people learn in undergraduate or graduate statistics programs.

The benefit of SPSS is that it is fairly easy to use for non-programmers. SPSS contains a large set of drop-down menus that make it fairly user-friendly. It covers most statistical methods pretty easily, including descriptive statistics, linear regressions, analysis of variance, and time-series methods. It also has the ability to add *modules*, which are like miniature programs to run more advanced techniques like survival calculations. Furthermore, viewing a dataset is pretty easy with the *variable view*

One of the drawbacks of SPSS is that it has difficulty handling higher order statistics like structural equation modeling. While it has a friendly point-and-click interface, it has limitations with programming additional techniques like forecasting. In addition, SPSS has a subscription license that has to be renewed on a 6-month, 1-year, or 2-year interval. This can be a bit rough on student-sized budgets (although there are student pricing options).

**Stata** is a statistical program that is often used in the economic and financial sectors. Even though it is used in a smaller area, it has some distinct advantages.

Stata has a point-and-click interface, but it is friendly to programmers as well. You can easily combine statistical methods in the *command window* without having to navigate a lot of menus. It is also a little more intuitive than SPSS, with simple commands like typing *anova* and then typing some variables. In addition, Stata has a *perpetual license*, which mean that you only have to purchase it once, though updates to handle advanced techniques will cost you.

Some of the drawback to Stata is that it can be tricky to learn all the commands that you may want to use. For example, it has a nice way to do structural equation modeling, but coding can be lengthy, tricky, and confusing. Also, while it is fairly easy to manipulate variables within Stata, it doesn’t translate variable files to other programs, like SPSS, very well.

**R** is a beast. What I mean is that it is quickly becoming the standard statistical software companies and schools are requesting potential employees or students to learn. But like SPSS and Stata, R has advantages and disadvantages.

R can handle almost any statistical techniques that you want to do with descriptive statistics, factor analysis, non-parametric test, and more. R can also be used to create almost any kind of graphic that you need or want. It is also easy to install specific *packages* (miniature programs for added functionality). Best of all (especially for students), it is completely free to download, modify, and use since it is considered open software. Programmers and statisticians are always adding new packages for methods like item response theory or propensity score analysis.

The major drawback to R is simultaneously its greatest advantage: coding. If you are a non-programmer, R can be daunting to learn because it is essentially programming the program to do what you want. You have to specify *exactly* what you want or need it do. This takes time and effort to really learn the method you want to use. Another drawback is that it can be cumbersome to use for advanced techniques like factor analysis or structural equation modeling.

SPSS, Stata, and R all have distinct advantages and disadvantages. Some are easy to learn while some are better for what you want to do but harder to learn. There are even some very specific ones like SAS and Mplus. Ultimately, to be the most versatile, you should pick the software that matches your needs and really dig into it. Happy statistics!

The post Top 3 Most Useful Statistics Programs appeared first on Magoosh Statistics Blog.

]]>