**Cross Validation** is a very important technique that is used widely by data scientists. The problem with machine learning models is that you won’t get to know how well a model performs until you test its performance on an independent data set (the data set which was not used for training the machine learning model).

Cross Validation comes to the rescue here and helps you estimate the performance of your model. One type of cross validation is the **K-Fold Cross Validation**. Keep reading to learn more!

## What is Cross Validation?

**Cross Validation** is a very useful technique for assessing the performance of machine learning models. It helps in knowing how the machine learning model would generalize to an independent data set. You want to use this technique to estimate how accurate the predictions your model will give in practice.

When you are given a machine learning problem, you will be given two type of data sets — known data (**training data set**) and unknown data (**test data set**). By using cross validation, you would be “testing” your machine learning model in the “training” phase to check for overfitting and to get an idea about how your machine learning model will generalize to independent data, which is the test data set given in the problem.

In one round of cross validation, you will have to divide your original training data set into two parts:

- Cross validation training set
- Cross validation testing set or Validation set

You will train your machine learning model on the cross validation training set and test the model’s predictions against the validation set. You will get to know how accurate your machine learning model’s predictions are when you compare the model’s predictions on the validation set and the actual labels of the data points in the validation set.

For reducing the variance, several rounds of cross validation are performed by using different cross validation training sets and cross validation testing sets. The results from all the rounds are averaged to estimate the accuracy of the machine learning model.

## K-Fold Cross Validation

**K-Fold Cross Validation** is a common type of cross validation that is widely used in machine learning.

K-fold cross validation is performed as per the following steps:

- Partition the original training data set into k equal subsets. Each subset is called a
**fold**. Let the folds be named as f_{1}, f_{2}, …, f_{k}. - For i = 1 to i = k
- Keep the fold f
_{i}as Validation set and keep all the remaining*k-1*folds in the Cross validation training set. - Train your machine learning model using the cross validation training set and calculate the accuracy of your model by validating the predicted results against the validation set.

- Keep the fold f
- Estimate the accuracy of your machine learning model by averaging the accuracies derived in all the
*k*cases of cross validation.

In the k-fold cross validation method, all the entries in the original training data set are used for both training as well as validation. Also, each entry is used for validation just once.

Generally, the value of *k* is taken to be 10, but it is not a strict rule, and *k* can take any value.

## Applications of Cross Validation

The cross validation technique can be used to **compare the performance of different machine learning models on the same data set**. To understand this point better, consider the following example.

Suppose you want to make a classifier for the MNIST data set, which consists of hand-written numerals from 0 to 9. You are considering using either *K Nearest Neighbours (KNN)* or *Support Vector Machine (SVM)*. To compare the performance of the two machine learning models on the given data set, you can use cross validation. This will help you determine which predictive model you should choose working with for the MNIST data set.

Cross validation can also be used for **selecting suitable parameters**. The example mentioned below will illustrate this point well.

Suppose you have to build a *K Nearest Neighbours (KNN)* classifier for the MNIST data set. To use this classifier, you should provide an appropriate value of the parameter *k* to the classifier. Choosing the value of *k* intuitively is not a good idea (beware of overfitting!). You can play around with different values of the parameter *k *and use cross validation to estimate the performance of the predictive model corresponding to each *k*. You should finally go ahead with the value of *k* that gives the best performance of the predictive model on the given data set.

For the *K Nearest Neighbours (KNN)* classifier, you can even choose different metrics (default is ‘minkowski’ if you use ‘KNeighborsClassifier’ of sklearn). So you can use cross validation to determine which metric is the best for the data set you have.

## Limitations of Cross Validation

For cross validation to give some meaningful results, the training set and the validation set are required to be drawn from the same population. Also, human biases need to be controlled, or else cross validation will not be fruitful.

The aim of this blog post was to introduce you to cross validation and to help you understand it better. In machine learning, it is always a good idea to play around with different predictive models and their parameters to arrive at the best choice. Fine-tuning your machine learning model is helpful in achieving good results, and of course, cross validation helps you know if you are on the right track to get a good predictive model!

## No comments yet.