Machine Learning with R

In layman terms, Machine Learning is predicting the future based on the past (Hal Daume III). It is a science that deals with getting computers to learn without explicit programming. Machine learning combines the concepts of computer science and statistics. In Machine Learning, past data is used to learn based on which a model is constructed. This model then is used to predict the future.

It deals with the study and creation of algorithms that can make predictions and learn from given data. It is a method used to construct complex models and algorithms that learn to make predictions. Machine learning algorithms work by making data-driven predictions or decisions, through building a mathematical model from input data.

Machine Learning Algorithms

A surplus number of machine learning algorithms have been developed that can be used depending on the nature of problem and the requirements of its solution. Machine learning algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

There are many algorithms that can be used in machine learning. Some common and important algorithms that you can explore are Decision Trees, Logistic regression, Clustering algorithms, Principal component analysis, Ordinary least square regression. Each of these algorithms have their pros and cons. Some algorithms work better for a given problem than others. So choosing algorithm according to the problem is very important to get accurate predictions/results.  Apart from these, there are many more algorithms which are useful in their own way.

Introduction to R

R is a free software environment for statistical computing and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.   It is compatible with Windows, MacOS and UNIX platforms. It can be downloaded for free on the official website.  

Let’s have a look at some of the basics of R. For coding in R, you type your program into its console where it executes the code. Various commands or functions can be entered into the console to carry out a task. For example, to read a given data in CSV (Comma Separated Value) format, we use the following command –

data1 <- read.csv(E:\Courses\ML\Data.csv)

Here, ‘Data.csv’ file was read and stored as ‘data1’. The read.csv function enables to read the data file from the specified directory. The data can also be read from other sources like url (web).

We can get an overview of our data using the str command. The command will look something like –

str(data1)

The str function gives an overview of the data. It gives the number of observations and the number of variables. It gives details of each variable including the data type and displays first few observations.

Another function that gives the overview of data is the summary() function. It gives the minimum value, first quantile, median, mean, third quantile and maximum value of the data. Depending on the composition of data to be known, any of these functions can be used to get the overview of the input data.

Applying Machine Learning Algorithms

Now that you have learned to read data and also get its overview and understand the composition of data, it’s time to see what we can learn from the data and predict. As we know there are many algorithms that can be used in machine learning. The algorithm to choose depends on nature of problem.

A machine learning algorithm can be categorized as supervised or unsupervised. In Supervised learning, sample inputs are provided to the computer which are labelled with their desired outputs. In this, the input data is used to learn and create more accurate models. The accuracy is improved by using the model on a part of the input data and compute errors. The model is improved by minimizing these errors.

In Unsupervised learning, the data is not labelled. So the algorithm is left on its own to learn. These are usually more complex and require good knowledge of machine learning to construct accurate models. We will be focussing on supervised algorithms in this article for ease of understanding.

An Example: Classifying wine type

Let us take an example to understand how machine learning in R takes place. Suppose we have to predict wines based on the features price, smell and taste. The data is in the form of a scale where the price, smell and taste are rated on a scale of 100. There are 3 types of wines. After reading and getting an overview of data, we are going to use the Naive Bayes algorithm to predict the wines.

First, we assign the features to x and the class to y using the code as follows –

x=wine[,-4]
y=wine$type

Now we create a model with cross validation = 20

model=train(x,y,’nb’,trControl=traincontrol(method=’cv’,number=10))

We can get the summary of the model easily by typing “model”. We then use the predict function to predict the result as follows –

predict(model$finalmodel,x)

We can get the error classification matrix to know how many errors were obtained. The code for it will be –

table(predict(model$finalModel,x)$class,y)

In this example, we had a look at a simple example that predicts the wine type based on the features price, smell and taste. It is based on the Naive Bayes classification algorithm. The Naive Bayes model is based on the principles of probability. It calculates prior probability based on the training values. Then it calculates the likelihood and multiplies with the prior probability to get the posterior probability based on which it predicts the wine type.

Logistic Regression in R

Another classification algorithm is the Logistic Regression. Commonly, it is used to predict a binary outcome (like true/false, yes/no) given a set of independent variables.  It uses maximum likelihood estimation. Let us look at an example to understand this algorithm better.

Suppose we have data of 500 customers and their salary. Based on this salary data, we have to find out if the customer would buy a car or not. We can use logistic regression to compute this.

First step is to load the data-

train=read.csv(E:\Courses\ML\Salary.csv)

Then we need to create training and validation data from the given data. For this we use ‘caTools’ package –

install.packages(‘caTools’)
library(caTools)
set.seed(88)
split <- sample.split(train$Recommended, SplitRatio = 0.75)

Now we will get the training and test data as follows –

salarytrain <- subset(train, split == TRUE)
salarytest <- subset(train, split == FALSE)

Now we construct our logistic regression model –

model <- glm (Recommended ~ .-ID, data = salarytrain, family = binomial)
summary(model)

Now we predict if the customer would buy a car or not –

predict 0.5)

This example shows how logistic regression can be used to predict a binary outcome. In this case, we predicted whether a customer would buy a car or not based on his salary. We can use this in more complex cases.

Conclusion

In this article, we first learned about machine learning and algorithms in it. Then we learned a little about R and how we can handle data in R. Then we saw how actually we can use machine learning algorithms in R with the help of two examples. In the examples, we covered the basics of Naive Bayes and Logistic Regression algorithms.

Machine Learning is a vast and rapidly growing field. It has many applications. It can be applied in a variety of other fields like financial market analysis, speech recognition, search engine optimization, natural language processing, bioinformatics, time series forecasting and many more.

Apart from R, there are many software that can be used to create machine learning models. However, R is one of the best to learn machine learning among them. It is easy to use and powerful that makes it versatile to use.

Comments are closed.


Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!