Understanding Overfitting and Underfitting for Data Science

Overfitting and Underfitting

Whenever a data scientist works to predict or classify a problem, they first detect accuracy by using the trained model to the train set and then to the test set.  If the accuracy is satisfactory, i.e., both the training and testing accuracy are good, then a particular model is considered for further development. But sometimes, models give poor results. A good machine learning model aims to generalize well from the training data to any data from that domain. So why does this happen? Here comes the major cause of the poor performance of machine learning models is Overfitting and Underfitting. Here we walk through in detail what is overfitting and underfitting and realizing the effect through Python coding and lastly, some technique to overcome these effects.

The terms overfitting and underfitting tell us whether a model succeeds in generalizing and learning the new data from unseen data to the model.

Brief information about Overfitting and Underfitting

Let’s clearly understand overfitting, underfitting and perfectly fit models.


Sign up for your weekly dose of what's up in emerging technology.
Source: Machine Learning Cheat Sheet

From the three graphs shown above, one can clearly understand that the leftmost figure line does not cover all the data points, so we can say that the model is under-fitted. In this case, the model has failed to generalize the pattern to the new dataset, leading to poor performance on testing. The under-fitted model can be easily seen as it gives very high errors on both training and testing data. This is because the dataset is not clean and contains noise, the model has High Bias, and the size of the training data is not enough.  

When it comes to the overfitting, as shown in the rightmost graph, it shows the model is covering all the data points correctly, and you might think this is a perfect fit. But actually, no, it is not a good fit! Because the model learns too many details from the dataset, it also considers noise. Thus, it negatively affects the new data set; not every detail that the model has learned during training needs also apply to the new data points, which gives a poor performance on testing or validation dataset. This is because the model has trained itself in a very complex manner and has high variance.

Download our Mobile App

Source: Flowchart 

The best fit model is shown by the middle graph,  where both training and testing (validation) loss are minimum, or we can say training and testing accuracy should be near each other and high in value.

Observing the effect of Overfitting and Underfitting practically: 

We will check the accuracy and error of two regression models, i.e. Decision Tree regressor and Linear regression.

Sklearn inbuilt Diabetes dataset is used for the modeling.

Import necessary libraries.

 import matplotlib.pyplot as plt  
 import pandas as pd
 import numpy as np
 from sklearn.datasets import load_diabetes
 from sklearn.metrics import mean_absolute_error
 from sklearn.model_selection import KFold
 from sklearn.linear_model import LinearRegression
 from sklearn.tree import DecisionTreeRegressor 

Loading the dataset and selecting the input and output features. There is no need to preprocess the data as it is already preprocessed. 

 load_data = load_diabetes() # Load the dataset
 x =     # selecting input features
 y =   # target variable
 pd.DataFrame(x,columns=load_data.feature_names).head() # See the distribution of data 

Here is how data looks like:

To split data into training and testing tests here, I have used K-Fold cross-validation, which gives K number of train and test data which helps to get accuracy and errors w.r.t number of the subset.

The approach that I have followed is pretty simple. Twenty folds are created, which are further used to get training and testing accuracy for each fold stored in the list, and the same thing carried out for mean absolute error. Lastly, four graphs are shown for train test error and train test accuracy, which shows the clear insight of this test.  

Code used for Linear regression and Decision Tree is completely the same, only the change is the estimator function, i.e. algorithm is changed where our model is defined. That’s why here, only code for Linear Regression is shown.

  kf = KFold(n_splits=20,shuffle=True) # defining fold parameter
 # created empty list to append score and error
 training_error = []
 training_accuracy = []
 testing_error = []
 testing_accuracy = []
 for train_index,test_index in kf.split(x):
     # divide the data into train and test
     x_train,x_test = x[train_index],x[test_index]
     y_train,y_test = y[train_index],y[test_index]
     #load the Linear Regression model
     model = LinearRegression(),y_train)
     #get the prediction for train and test data 
     train_data_pred = model.predict(x_train)
     test_data_pred = model.predict(x_test)
     #appending the errors to the list
     #appending the accuracy to the list

Code to show the plots of accuracy and errors of train and test data.

plt.xlabel('No of folds')
plt.title('Training error across folds')

plt.xlabel('No of folds')
plt.title('Testing error across folds')

plt.xlabel('No of folds')
plt.title('Testing accuracy across folds')

plt.xlabel('No of folds')
plt.title('Testing accuracy across folds')

Output of plot Linear Regression:

Output plot of Decision Tree:


If we compare the two algorithms for the Linear Regression, it is clearly shown in the 1st and 2nd plots, the error in training and testing are nearly the same, but it is significantly high, and for the accuracy shown in the 3rd and 4th plot, both train and test accuracy are nearly the same but again it is significantly low. So from the above explanation, can you guess what the problem with that model is? If you have guessed underfitting, then yes, you are right it is a problem of Underfitting. The Linear Regression model fails to learn patterns associated with the training data set and also fails to generalize it on the testing set.  

From the plot of the Decision Tree, if you see the error plots 1st and 2nd for training and testing respectively, surprisingly, the error is literally zero for the training set and for the testing it has shown a huge amount of error. The same thing is observed for the accuracy plots also where training accuracy is 100%, and on the other hand, testing accuracy is more than –80%. Getting accuracy in minus means, for the fold number 16 the regression line is not following the trend of the data and it does not make any sense. So you might have guessed it is an overfitting problem. As explained earlier, while training a Decision Tree, the algorithm learns too much from data. That’s why it failed on the testing dataset.

Here is the Colab Notebook for the above implementation of code

End Points:

 So is there any way to deal with these problems? The most commonly encountered problem is overfitting, and also, it is more important to know whether a model is Overfitted rather than Underfitted. Because evaluation on testing is far different from actual results that we care about most. Here you can find various techniques used to deal with Overfitting problems and Underfitting problems.

Again, it is a black box kind of thing, where you have to train and test your model for various algorithms to limit these problems.

More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.