Let’s suppose you’re engaged in a problem of text classification. You have refined your training set and also tried it out with the help of Naive Bayes. You are pretty confident about your dataset and want to move on to the further steps. It’s then that you enter Support Vector Machines or SVM, a dependable classification algorithm which is fast and lets you perform efficiently with a limited amount of data.

You might have probably dug deeper by now and come across terms like *kernel trick, linearly separable and kernel functions*. Do not get scared away by these terms! The basic concept behind the SVM algorithm is quite easy and using it for classifying natural language is totally not complicated!

Now let’s proceed ahead!

## How does SVM work?

The best way to understand the basics of Support Vector Machines is to use a simple example. Let’s begin with two categories: blue and red, with our data consisting of two features, x and y respectively. We wish to obtain a classifier which for a given pair of coordinates (x,y) processes an output only if it’s blue or red. Below we can observe a plot of training data already labelled, on a plane :

A support vector machine outputs the hyperplane best suitable for partitioning the categories, which generally is a simple line in 2 dimensions, by taking these data points. This particular line is known as the ‘decision boundary’, all points on one side of it are classified as red, and anything on the other side we call as blue.

But, our question still remains unanswered, what exactly can be called as the best ‘hyperplane’? The plane which gives maximum margins to all the categories is labelled as a hyperplane in SVM. It can also be defined as the hyperplane (which in this very case is a line) with the largest distance to the closest element in every particular category.

## Nonlinear data

This was a simple example because the data used was clearly linearly separable i.e. we drew a single straight line to differentiate between the red and blue. But usually, it’s not that simple every time! Let’s take another example to understand better –

We can clearly observe no linear decision boundary. Also, we can observe one more thing: all the vectors are visibly distinguishable and it appears to us that it’ll be easy separating them.

What we will proceed with is the addition of a third dimension. Till now we were working with only 2 dimensions, namely x and y. We create ‘z’, a new dimension and also rule out a formula convenient enough for us, for ‘z’ to be calculated: z = x² + y² (which will remind you the equation of a circle)

Adding a new dimension, makes available for us a three-dimensional space. Observing a piece of the space looks something like this:

What can SVM do to this? Let’s find out:

Looks great! Make a note, we are working in three dimensions now, which means the hyperplane resembles a plane parallel to the x-axis for a fixed z (let z = l).

We are left with just mapping it out back to again 2 dimensions:

And we are done! We have successfully separated both the categories using SVM by a decision boundary which is a circle of radius 1.

## The kernel trick

Earlier here, we successfully classified nonlinear data by using a higher dimension to map our space. But evaluating this transformation becomes pretty confusing and mentally exhausting as a whole lot of new dimensions begin existing with every individually involving complicated calculations. Working this out for every vector in our dataset can be tiresome and the need to figure out a better solution becomes necessary.

And so, here’s the trick: SVM doesn’t necessarily require the actual vectors to create the magic, it can work great even with only the dot products between the vectors. Which banishes completely the need of long calculations for the new dimensions! What is done instead is:

- We begin with imagining the new space required:

z = x² + y² - Visualising the dot product in each cases:

a · b = xa · xb + ya · yb + za · zb

a · b = xa · xb + ya · yb + (xa² + ya²) · (xb² + yb²) - Direct the SVM to do start its process but this time utilizing the created new dot product. This is what we call a kernel function.

This right there is the **kernel trick** for reducing complicated expensive solutions. Generally, what we receive is a linear classifier provided our kernel is linear. However, if we wish to obtain a nonlinear classifier, we are required to use a nonlinear kernel like in the above example. As a result, a transformation of the data set is not required at all too. Only change in the dot product of that particular space is required and SVM can then easily proceed.

Kernel trick is not really a part of SVM, we can use it with various other linear classifiers like logistic regression too. The main purpose of a Support Vector Machine is to detect the decision boundary.

## How to use SVM with natural language classification?

Now vectors can be classified in multidimensional space as well. Having done this, our next need arises for text classification by applying this to algorithms. The first thing to be done for this is the conversion of text pieces to various vectors comprising of numbers so that SVM can run on them successfully. To put it in a simple way, we inquire about the features we need to use to properly classify texts using SVM!

The generally understood answer for this is word frequencies, meaning every text is treated like a bag of words and for each word appearing in the bag we possess a particular feature. Value of this feature will represent the frequency with which the word is present in the text.

Resolving to simple words, it comes down to a simple calculation of the number of times every word appears in the text and then, dividing it by the total number of words present. For the advanced alternative for calculating frequencies, we can use TF-IDF.

Completed with this, we have a dataset with every text being presented in vector form with thousands of dimensions! Each one showcasing the frequency of words in the text. Great! This is actually what we input to SVM for training. Preprocessing techinques can be used for improving this, such as removing stopwords, stemming, and utilizing n-grams.

## Choosing a kernel function

Having obtained feature vectors, we now have to do the task of selecting a kernel function suitable for our model. Each individual problem is distinct and so, the kernel function for every problem will depend on what type data is involved. In the particular example that we chose, data we had was stacked in concentric circles and hence, we selected a kernel equivalent to those data points.

On that note, what is best suited for processing our natural language? Is a non-linear classifier required? Or is our data is even linearly separable or not? Answering these questions helps us realise that using linear kernels is the best. But, why?

In the example we studied, only 2 features were present. Uses of SVM in real life problems may consist of tens and even hundreds of such features. Also, to your amazement, NLP classifiers make use of at least thousand features as they have the ability to have up one feature for every word appearing in the training set. This transforms our problem; non-linear kernels are a good idea in many cases but with this large lot of features, non-linear kernels might end up being an overfit for the data. Hence, what’s best is to stick to the old is gold, i.e. linear kernels, which are the most efficient performers in such type cases.

## Putting it all together

The thing we are left with now is the ‘training’! We need to acquire our piece of labelled texts, transform them to corresponding vectors using word frequencies and finally put them in the algorithm, which will produce a model using our selected kernel function. At last, when we get our unlabelled text that needs classification, it is converted to a vector and fed to the model, which in turn outputs the category of that particular text.

## Final words

And that’s the basics of Support Vector Machines!

To sum up:

- A support vector machine allows you to classify data that are linearly separable.
- If it isn’t linearly separable, you can use the kernel trick to make it work.
- However, for text classification it’s better to just stick to a linear kernel.

Compared to newer algorithms like neural networks, they have two main advantages: higher speed and better performance with a limited number of samples (in the thousands). This makes the algorithm very suitable for text classification problems, where it’s common to have access to a dataset of at most a couple of thousands of tagged samples.

## Comments are closed.