offers data science lesson videos made simple!

What is an R Data Frame?

You all must be well aware of data structures namely vectors, matrices, lists and others. These concepts are no more new to you. R is a statistical programming language and is very often used when working with data sets in statistics. These data sets comprise of instances, or in other words observations, which have some variables associated with them.

As an example – Let’s take a data set of five people. Now, here each person represents an ‘instance’ and ‘variables’ are represented by the properties of these people such as their name, age, whether they have children or not etc. Problem is to store such an information in the form of a matrix or R. Certainly not possible, as all names will take the form of characters and age would obviously be numeric, hence they do not fit in the matrix.

This, however, could work well on the list because you can practically place anything on the list. One can even create a list of other lists where the individual sublist is a person, his name, age and so on. Though these type of lists are not used due to their complex structure which makes it difficult to work with them.

What happens in cases when you require all the ages. You would have to write loads of R code, and thinking of which data structure to use here, allow me to introduce you to “DATA FRAME”.

Data Frame

From the knowledge of statistical software employed in empirical research, we come upon the concept of data frame. Data frame is the fundamental data structure used for starting typical data. It is similar to a matrix in that it has rows and columns. In case of data frames, columns represent the variables (properties of the people) while the rows correspond to observations (the people we used in our above example).

The major difference between matrices and a data frame is that the latter can comprise elements of different types simultaneously, such as a column containing characters, another one with numeric values, and a third one with logical inputs too. And this is the exact requirement for representing our person’s information in the dataset (taking the previous example).

We can very conveniently have three different columns at the same time, one with characters for names, one with numeric values for age, and one to denote whether the person has children or not, i.e. the logical column.

We still are left with restrictions on the data type stored, i.e. in a particular column, all elements should be of the same type. But this isn’t really a problem because one column always requires the same data type. For example, the age column will require a numeric value always irrespective of the observation.

R Data Frame

A “data frame” is basically a quasi-builtin type of data in R. It’s not a primitive; that is, the language definition mentions “Data frame objects” but only in passing.
“Data frame is a list of factors, vectors, and matrices with all of these having the same length (equal number of rows in matrices). Additionally, a data frame also has names attributes for labelling of variables and also row name attributes for the labelling of cases.”

Here, matrices and data frames are not primitive, but vectors, lists and ‘factors’ certainly are primitive. Although, in practice, we take data frame central to R.

Creating a Data Frame

Coming now to the practical part of creating frames. Generally, you’re not required to create a data frame yourself, you can easily import it from some other sources such as a CSV file or a relational database like SQL. Data frame can also be taken from software packages like Excel or SPSS.

However, R also provides ways to create data frames manually. One needs to use a specific function namely the data.frame() function.

As an example, we create a data frame with five observations, three variables (to pass the data frame function) and three vectors with the same length five. The vectors that we pass will correspond to the respective columns.

We create the vectors first. Naming them as – Name, age and child.

> name <- c(“John”, ”Elon”, ”Frank ”, ”Julia”, ”Pete”)
> age <- c(28, 30, 21, 39, 35)
> child <- c(FALSE, TRUE, TRUE, FALSE, TRUE)

Next is the very simple step of calling the data frame function.

> df <- data.frame(name, age, child)

The printout of data frame already shows very clearly that we are dealing with a data set.

NameAgeChild
1John28False
2Elon30True
3Frank21True
4Julia39False
5Pete35True

Do notice here the way the data frame function inverts the names of the columns from the variable names you passed it. To specify the names particularly you can use the same techniques as used above for vectors and lists. Name functions can also be used or equal signs inside the data frame function can be used to name the data frame columns right away.

> df <- data.frame(Name = name, Age = age, Child = child)

Similar to matrices, naming of the data frame rows can also be done but that generally doesn’t come out well so I won’t go into its details.

> df <- data.frame(name, age, child).

Let us now discuss the structure of data frames given below as examples

> str(df)
‘data.frame’ : 5 obs. of 3 variables:
\$ Name : Factor w/ 5 levels “John”, “Elon”,…: 1 5 3 4 2
\$ Age : num 28 30 21 39 35
\$ Child : logi FALSE TRUE TRUE FALSE TRUE

Upon studying the structures of above examples one can observe two things.

Firstly, the printout of these looks same to that of lists, this is because on the inside the data frame in itself is also a list. In above case, it is a list with three elements referring to each of the columns in the data frame. Also, every element is a vector of length five equal to the number of observations.

In lists, the length of the vectors that you use on the list need not be necessarily equal whereas, this is a major requirement in data frames. Creating a data frame with three vectors having all different lengths will simply give you an error!!!

> data.frame(name[-1], age, child)
Error : arguments imply differing number of rows: 4, 5

The second thing that can be observed is that the name of the column expected to be a character is actually a factor. This is because by default the strings are stored as factors. To remove this default behaviour you can set the stringsAsFactors argument of the data frame function to be false.

> data.frame(name, age, child, stringAsFactors = FALSE)

And hence, the name column actually contains characters.

With this new knowledge of data frames, you’re ready with some basic idea of and creation of data frames through R language. It’s the most useful and powerful data structure.

Whenever you’re experimenting with data frames remember that they’re actually working in lists. This gives the data frame the ability to store vectors of different types. On the top of that, R also has some additional functionality built in to easily extend and subset the data frames.