offers data science lesson videos made simple!

Sign up or log in to Magoosh Data Science.

What Is Overfitting?

There is one mistake that many machine learning beginners unknowingly commit. This single mistake can alone ruin the entire machine learning model, no matter how much effort was put in – I’m not exaggerating! Any guesses what the culprit could be?

Well, it’s overfitting.

In this blog post, you will get to know more about overfitting (and how to avoid it!). Keep reading!

Signal and Noise

Let us begin by explaining some terms which are quite important in Data Science. Two of such words are Signal and Noise. Let’s get to know these terms better.

A “Signal” is the underlying pattern that your machine learning model aims to learn from the data. “Noise” refers to the irrelevant and random data in the dataset.

If the dataset is large, the machine learning model will learn the signal better as compared to when the dataset is small. Let me give you an example to help you understand it better.

Suppose you want to model weight versus age of school going children. If you sample a large number of students for this study, you will get a better signal as compared to if you model data for, say, 20 students. In the small sample of 20 students, it is quite probable that, for example, 7 students are overweight; these students may be otherwise treated as outliers if the sample was large. Hence, a small data sample generally has more proportion of noise than a big dataset.

Noise interferes with signal, which makes it difficult for the machine learning model to identify the underlying signal.

A good machine learning model is the one that is able to identify the correct signal for various datasets. It should be able to distinguish signal from the noise present in the dataset.

Overfitting and Underfitting

Given a dataset and a machine learning model, the goodness of fit refers to how close the predicted values of the machine learning model are to the actual values in the dataset. If the machine learning model also learns the noise along with the relevant data, then the model is said to be an “overfitted model.”

Overfitting occurs when the machine learning model is very complex. In such a case, the model learns noise in the training data and performs very well on it. However, when you use to model to test other datasets, the model doesn’t perform well and gives high errors. An overfit model has a huge variance.

Underfitting, on the other hand, occurs when the machine learning model is very simple. It may happen when the model has very few features or is regularized way more than needed. An underfit machine learning model has a huge bias and less variance in its predictions, which leads to large errors.

A good machine learning model is the one that is not too biased and also doesn’t has a lot of variance.

How can we detect Overfitting?

Actually, we cannot be sure of the performance of our machine learning model, until we test it on a data sample. So we need to have a training dataset and a test dataset.

We can make a training dataset and a test dataset from our initial dataset. Just split the initial dataset in a ratio, say 80:20. Keep the 80% sample for training the machine learning model and keep the remaining 20% of the initial dataset for testing purposes.

While training and tuning your machine learning model, you can use the technique of Cross-Validation. Do not touch the testing dataset at all till you are done with training your machine learning model. Once your model is trained, you can use your test dataset to test the actual performance of your model.

If your model performs way too well on the training dataset but shows a significantly lower accuracy on the test dataset, your model is most likely overfitted. If you see an accuracy of 99% on a training dataset but 52% accuracy on a test dataset — Overfitting alert!

Ways to prevent Overfitting

There are several ways to prevent overfitting of your machine learning model, some of which are mentioned below.

Use more data for training

Using a big training dataset generally helps the machine learning model to pick of the signal efficiently. However, this technique may not work every time. If we add lots of noisy data and the relevant data is sparse, even having a huge amount of total data won’t help the model in accurately predicting values.

Cross-Validation technique

Cross-validation is a very nice technique for training a machine learning model. It helps in preventing overfitting of models quite effectively.

Working of a k-fold cross-validation technique is explained as follows:

  1. Divide the training dataset into k subsets (called folds)
  2. For i = 1 to i = k, treat the ith fold as test dataset and train the model using the remaining k – 1 folds.

Stopping Early

When you train a machine learning model iteratively, you will observe that up to a certain number of iterations, the performance of the model improves. After a certain point, if you increase the number of iterations, the model will perform better on the training dataset, but the model will become overfitted and will perform poorly on test datasets.

Thus, you should stop the training iterations of your model before there is any overfitting in the model.

Regularization

Regularization is done to make the machine learning model simpler. It consists of many methods. The regularization technique used for a machine learning model depends on the type of the model. For example, if the model is a decision tree, regularization could be pruning the tree. If the model is a regression, you could add a penalty to the cost function for regularization.

Machine learning beginners frequently fall prey to overfitting of their machine learning models. After reading this post, I hope you are aware and cautious when you design your next model 🙂

Comments are closed.


Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!