Machine learning involves processes used in data mining and predictive modeling. These processes look into the data for patterns and monitor the program actions accordingly. When you go online shopping, haven’t you noticed ads that are tailored to your interests and past purchases? The reason behind this is machine learning used in recommendation engines that personalize online ad delivery for its customers. Other than this type of marketing, machine learning has wide use in network security, threat detection, spam filtering, fraud detection predictive maintenance, and building news feeds.
For instance, Facebook’s news feed for individual users is personalized with the help of machine learning. When a user hits “like” on a particular friend’s post and stops frequently while scrolling to read, the news feed automatically begins to display more recent activities of that friend. In case you’re curious to know how this works, the software employs predictive analytics and statistical analysis for identification of patterns in the user’s data and then using it to formulate the News Feed. Also, if our user suddenly stops adhering to that friend’s posts, the new data/pattern is recorded in the dataset and the adjustments are made accordingly.
Types of Machine Learning Algorithms
Supervised Learning Algorithm
How it works: Such an algorithm includes various dependent/ target or outcome variables that are predicted out of given set of independent variables/predictors. These set of variables are utilized to create a function for mapping our inputs to their desired outputs. Unless a high level of accuracy is attained on the training data, the training process continues. Examples using Supervised Learning Algorithm: Decision Tree, Random Forest, Regression, Logistic Regression, KNN, etc.
Unsupervised Learning Algorithm
How it works: For this particular algorithm, there are no target or outcome variables for prediction. The basic purpose is to cluster population into various different groups, especially when dividing them for certain interventions. Examples using Unsupervised Learning Algorithm: K-means, Apriori algorithm.
Here are the top 5 Machine Learning Algorithms for you to understand.
Linear regression utilizes continuous variables to estimate real values such as total sales, cost of houses, number of calls, etc. A relationship is built between independent and dependent variables via fitting the best line. The best fit line used is called “the regression line” and depicted by the equation Y= a *X + b.
Here, we will try to understand linear regression by rewinding back to our childhood experiences. Imagine a situation wherein a fifth-grade child is asked to divide other children in his class in ascending order of their height without knowing the heights! The child is most likely to do a quick visual analysis of everyone on the basis of their build/structure and then arrange them. This is exactly how linear regression works. The child here quickly saw a relationship between weight and the height and built which resembles the above equation and performed the task.
In the equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
Here, a and b are the derived coefficients that rely on reducing the summation of the square of the difference between data points and the regression line.
Logistic Regression is an algorithm that takes an input and classifies it. Take an example problem – suppose, you want to classify emails coming into your inbox as “spam” or “not-spam,” you would want to train a logistic regression model. This is an example of a simple “binary” classifier where we classify input as belonging to one of the two classes. The logistic regression takes a vector X as input and outputs a vector Y that effectively denotes the class to which X belongs.
We may have multiple classes as in “handwritten digit recognition.” In this problem, we are given an image containing a handwritten digit and we need to classify it among [0, 1 … , 8, 9]. For instance, a good dataset for handwritten digit recognition problem is the MNIST database. The idea is to convert the image pixels into a linear vector X and then use logistic regression over 10 classes (0 to 9).
This is a theoretical classifier that details the working of SVM in real life. N-dimensional space is created in our data by the numeric input variables. For instance, a 2-D space will be formed when two variables are fed. Also, two points in the input variable class are finely separated based on their class, either 0 or 1, by using a hyperplane in SVM. This place can be seen as a line in 2-D and we presume our input points to be clearly separated by this line.
For example: B0 + (B1 * X1) + (B2 * X2) = 0
We use the learning algorithm for deducing intercept (B0) and coefficients, B1 and B2 calculate the slope of the line. This line helps make classifications. Placing input values in the equation deduces whether the point fed lies below or above the line.
For an above the line case, a value greater than 0 is returned in the equation and exists in the first class (class 0).
For below the line case, a value less than 0 is returned with the point belonging to second class (class 1). A value close to the 0 implies a point close to the line. Such a point is not easy to classify.
The model may confide in prediction for cases when the value returned is large in magnitude.
The space left between the data points closest to the line and the line itself is called the margin. The optimal line which best separates two classes is referred to as “the line with the largest margin” or the “Maximal-margin Hyperplane.”
Pertaining to the above definition, we compute margin as the perpendicular distance between the line and the closest points. Only such points are considered while creating a classifier and are called as Support Vectors. They define hyperplane. An optimization process is utilized to maximize the margins.
This is a supervised classification type algorithm. In a forest, the more trees there are, the more robust is the appearance of the forest. Likewise, in such a classifier, higher levels of accuracy are achieved with a larger number of trees.
How the Random Forest Algorithm Works
First, let’s have a quick overview of the random forest algorithm before we go into the details.
Pseudocode for the algorithm is divided into 2 stages:
- Creation of random forest.
- Code to perform prediction using the created classifier.
Now we’ll discuss these stages in detail. First is the creation stage.
Random Forest pseudocode
- Select “k” features from a total of “m” features, randomly. (k << m)
- Within “k” features, compute node “d” using the best split point.
- Now, split the nodes further into daughter nodes using best split again.
- Rerun steps 1 to 3 unless “l” number of nodes are obtained.
- Build forest by repeating steps 1 to 4 for “n” number of times to create “n” number of trees.
We begin our random forest algorithm by random selection of “k” features out of “m” features, then utilizing these “k” features to search the root node using the approach of best split.
In the next stage, we again us the same best split approach, but this time for calculating the daughter nodes. Finally, we obtain a tree with a root node and target as the leaf node. Similarly, we recur the steps to generate “n” random trees which ultimately constitute the random forest.
In machine learning and statistics, dimensionality reduction is a method for decreasing the random variables used by generating a set of principal variables. It is further divided into feature extraction and feature selection.
Feature selection – In this approach, we look for a subset of original variables i.e. attributes or features. Three basic strategies are employed: the wrapper strategy (accuracy guided search), the filter strategy (such as information gain) and lastly, the embedded strategy, wherein features are chosen to be added or removed based on the prediction error while the construction of the model. Also, refer to combinatorial optimization problems.
Processes of data analysis (regression or classification) can be executed highly accurately in reduced spaces rather than in original ones.
Feature extraction – This operation is used for reducing data in high-dimensional space to a lesser number of dimensions. Data transformation may be linear, similar to principal component analysis (PCA). For dimensionality reduction in multidimensional data, we use tensor depiction via multilinear subspace learning.
Examples of dimensionality reduction:
Principal Component Analysis (PCA)
The most popular linear technique used in reduction of dimensions is ‘Principal Component Analysis.’ It operates the data into a lower-dimensional space using a linear mapping while keeping the data’s variance maximized. Frequently, the covariance or the correlation matrix for the data is created so as to calculate the eigenvectors on this matrix. Eigenvectors associated to the largest eigenvalues now reconstruct majority of the variance related to the original data. Generally, the large-scale behavior of the system physically is used for understanding the first few vectors. The initial space consisting dimensions of the points gets decreased to a final space stretching over a few eigenvectors largely due to data loss but, yet pertaining the variance.
Principal component analysis is deployed in a more non-linear way using a kernel trick for generating non-linear mappings to enlarge variance in the data. Such a technique is what we call a kernel PCA.
Graph-Based Kernel PCA
Proficient non-linear techniques comprise various manifold learning techniques like Hessian LLE, locally linear embedding (LLE), Isomap, Laplacian Eigenmaps, and local tangent space alignment (LTSA). The techniques listed above utilize a cost function for constructing low-dimensional data representation, which has the general properties of the data and also defines a graph-based kernel for Kernel PCA.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is more a general analysis of Fisher’s linear discriminant used in statistics, machine learning, and pattern recognition. Its objective is to search for such a linear combination of features, which allows the categorization or separation of classes of events/objects.
Generalized Discriminant Analysis (GDA)
GDA uses a kernel function operator to work out non-linear discriminant analysis. The basics lying beneath go back to Support Vector Machines (SVM) due to the similarity in the GDA’s mapping of the input vectors into high-dimensional feature space. Like LDA, the main goal of GDA is the projection of the features in lesser dimensional space by expanding the between-class to within-class scatter ratio.
In this blog post, you learned about 5 Machine learning algorithms. Stay tuned to the Magoosh Data Science blogs for more on data science!