R Data Analysis Basics

Data science is a new field of study that deals with data-driven methods, processes and systems in order to get insights and come up with inferences from a given data. It consists of subdomains that include machine learning, classification, clustering, data mining, visualization etc. It combines concepts of maths, statistics, computer science and data analytics. Data science covers a large variety of topics of these fields and their applications in various fields. The term “data science” became popular recently and lot of development is ongoing in this field.

Introduction to R

R is a free software which provides environment for data analysis and statistical computing. It consists of a language, a debugger, a run-time environment with graphics, access to certain system functions, and the ability to run programs from script files.   It is compatible with Windows, MacOS and UNIX platforms. It can be downloaded for free on this website.

R can be used to implement a wide variety of techniques such as linear and non-linear modelling, time-series analysis, classification, clustering and a variety of other data science techniques. It is commonly used for data science because it is easy to use and flexible in terms of running computational tasks by using easily accessible packages.

R consists of a Script, Console, Environment and Graphical output. Script is where you write the code. Console is where the output is displayed once the code is run. Code can also be entered directly into console but that cannot be traced later. Environment displays the external elements such as data sets, variables, functions that are added. It displays the output graphs. It can also be used to display R documentation.

Objects and Data Types in R

An object is a data structure having some attributes and methods which act on its attributes. In R, the objects can be vectors, data frames, variables, matrices etc. There are five classes of objects in R namely Character, Numeric, Integer, Complex and Logical.

The data types in R includes List, Vector, Matrices, Data Frame etc. Let us understand each type as it will be very helpful while programming in R.

Vectors are a collection of objects. A vector is created using the c() command. Usually, they contain objects of the same class. If objects of different classes are put into a vector, objects of different kind are converted into one class. Let us look at some examples. A vector of integer class with elements 5,2,6,8,4 would be declared in the following way –

ints <- c(5,2,6,8,4)

But if we declare a vector like this –

vec <- c(“football”,”cricket”,10,”tennis”,4.11)

Then, all the objects will be converted to character class.

Data Frames are used to store tabular data. Basically, it is a collection of vectors. The different vectors can be of different classes. Let us take a look at an example to see how data frames work. Suppose we have a name and age data of individuals. This data can be stored in a data frame in the following way –
agedata <- data.frame(name = c(“Ron”,”Chang”,”May”,”Raj”), salary = c(25,28,24,19))

Here, agedata is the name of our data frame. The name vector consists of names of the individuals and the age vector stored the age. The name vector is of character class whereas age is of integer class.

These two are the most commonly used data types. Other data types include lists and matrices. List is similar to vector except it can have different classes of objects. A matrix consists of rows and columns of objects of the same class.

Installing Packages in R

R packages are collections of functions and data sets developed by the community. Typically, a package will include code, documentation for the package and for the functions inside and data sets. It improves the functionality of R. More data handling tasks can be done with the help of packages.

To install a package, simply type and run (in Console or Script) –
install.packages(“package name”)

It installs a package from the official CRAN repository. The package name can be any package from the CRAN repository. There are many packages available in the CRAN repository (almost 12,000 packages). Some very useful packages include ggplot2, readr, dplyr, plyr, gbm, randomForest, swirl etc.

Different packages have different uses. For example, ggplot2 can be used to create simple or complex graphs, randomForest and gbm can be usned for regression and classification, swirl enables you to use R console as an interactive learning environment, dplyr is used for data manipulation and so on.

Control Structures in R

Control structures are portions of program code that contain statements within them and, depending on the circumstances, execute these statements in a certain way. The most common control structures used in R are if-else, for and while. Let us look at each to understand their uses and differences.

The if,else is used to test a condition. For a given condition, there are two sets of statements. If the condition is satisfied, a set of statements get executed. Otherwise, the other set of statements get executed. The syntax is –

if (){
     ##set 1 of statements
} else {
     ##set 2 of statements
}

The for is used when we want to run a set of statements a given number of times. The syntax is –

for (){
      #set of statements
}

The while is used in cases where we want to run a set of statements only if a given condition is satisfied. The syntax is –

while(){
    ##set of statements
}

Data Analytics using R

Having learned the basics of programming in R, let us now see how to use R programming on data. We will be reading and understanding the data first. Then we will see what we can learn from our data and try to make predictions as well.

Let us start off by reading the data file into R. To read a data file in Comma Separated Value (CSV) format, the following command is used –

data1 <- read.csv(E:\Courses\ML\Data.csv)

Here, ‘Data.csv’ file was read and stored as ‘data1’. The read.csv function enables to read the data file from the specified directory. The data can also be read from other sources like url (web).

Now let us see how we can get an overview of our data. There are two commonly used commands – str() and summary(). Each gives an overview of the data but in different ways. Let us look at each to understand better.

The str function gives an overview of the data. It gives the number of observations and the number of variables. It gives details of each variable including the data type and displays first few observations.

The summary() function gives the minimum value, first quantile, median, mean, third quantile and maximum value of the data.

Let us take an example to understand how machine learning in R takes place. Suppose we have to predict wines based on the features price, smell and taste. The data is in the form of a scale where the price, smell and taste are rated on a scale of 100. There are 3 types of wines. After reading and getting an overview of data, we are going to use the Naive Bayes algorithm to predict the wines.

First, we assign the features to x and the class to y using the code as follows –
x=wine[,-4]
y=wine$type

Now we create a model with cross validation = 20
model=train(x,y,’nb’,trControl=traincontrol(method=’cv’,number=10))

We can get the summary of the model easily by typing “model”. We then use the predict function to predict the result as follows –

predict(model$finalmodel,x)

We can get the error classification matrix to know how many errors were obtained. The code for it will be –

table(predict(model$finalModel,x)$class,y)

In this example, we had a look at a simple example that predicts the wine type based on the features price, smell and taste. It is based on the Naive Bayes classification algorithm. The Naive Bayes model is based on the principles of probability. It calculates prior probability based on the training values. Then it calculates the likelihood and multiplies with the prior probability to get the posterior probability based on which it predicts the wine type.

Conclusion

In this article, we learned how to use R programming for data science. We learned various basics of programming in R and also looked at an example where we used machine learning to make a prediction. This article covers some basics and also provides insights into the vast topic that is data science.

Comments are closed.


Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!