Whenever a data scientist works to predict or classify a problem, they first detect accuracy by using the trained model to the train set and then to the test set. If the accuracy is satisfactory, i.e., both the training and testing accuracy are good, then a particular model is considered for further development. But sometimes, models give poor results. A good machine learning model aims to generalize well from the training data to any data from that domain. So why does this happen? Here comes the major cause of the poor performance of machine learning models is Overfitting and Underfitting. Here we walk through in detail what is overfitting and underfitting and realizing the effect through Python coding and lastly, some technique to overcome these effects.
The terms overfitting and underfitting tell us whether a model succeeds in generalizing and learning the new data from unseen data to the model.
Brief information about Overfitting and Underfitting
Let’s clearly understand overfitting, underfitting and perfectly fit models.
From the three graphs shown above, one can clearly understand that the leftmost figure line does not cover all the data points, so we can say that the model is under-fitted. In this case, the model has failed to generalize the pattern to the new dataset, leading to poor performance on testing. The under-fitted model can be easily seen as it gives very high errors on both training and testing data. This is because the dataset is not clean and contains noise, the model has High Bias, and the size of the training data is not enough.
When it comes to the overfitting, as shown in the rightmost graph, it shows the model is covering all the data points correctly, and you might think this is a perfect fit. But actually, no, it is not a good fit! Because the model learns too many details from the dataset, it also considers noise. Thus, it negatively affects the new data set; not every detail that the model has learned during training needs also apply to the new data points, which gives a poor performance on testing or validation dataset. This is because the model has trained itself in a very complex manner and has high variance.
The best fit model is shown by the middle graph, where both training and testing (validation) loss are minimum, or we can say training and testing accuracy should be near each other and high in value.
Observing the effect of Overfitting and Underfitting practically:
We will check the accuracy and error of two regression models, i.e. Decision Tree regressor and Linear regression.
Sklearn inbuilt Diabetes dataset is used for the modeling.
Import necessary libraries.
import matplotlib.pyplot as plt import pandas as pd import numpy as np from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error from sklearn.model_selection import KFold from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor
Loading the dataset and selecting the input and output features. There is no need to preprocess the data as it is already preprocessed.
load_data = load_diabetes() # Load the dataset x = load_data.data # selecting input features y = load_data.target # target variable pd.DataFrame(x,columns=load_data.feature_names).head() # See the distribution of data
Here is how data looks like:
To split data into training and testing tests here, I have used K-Fold cross-validation, which gives K number of train and test data which helps to get accuracy and errors w.r.t number of the subset.
The approach that I have followed is pretty simple. Twenty folds are created, which are further used to get training and testing accuracy for each fold stored in the list, and the same thing carried out for mean absolute error. Lastly, four graphs are shown for train test error and train test accuracy, which shows the clear insight of this test.
Code used for Linear regression and Decision Tree is completely the same, only the change is the estimator function, i.e. algorithm is changed where our model is defined. That’s why here, only code for Linear Regression is shown.
kf = KFold(n_splits=20,shuffle=True) # defining fold parameter # created empty list to append score and error training_error =  training_accuracy =  testing_error =  testing_accuracy =  for train_index,test_index in kf.split(x): # divide the data into train and test x_train,x_test = x[train_index],x[test_index] y_train,y_test = y[train_index],y[test_index] #load the Linear Regression model model = LinearRegression() model.fit(x_train,y_train) #get the prediction for train and test data train_data_pred = model.predict(x_train) test_data_pred = model.predict(x_test) #appending the errors to the list training_error.append(mean_absolute_error(y_train,train_data_pred)) testing_error.append(mean_absolute_error(y_test,test_data_pred)) #appending the accuracy to the list training_accuracy.append(model.score(x_train,y_train)) testing_accuracy.append(model.score(x_test,y_test))
Code to show the plots of accuracy and errors of train and test data.
plt.figure(figsize=(10,10)) plt.subplot(2,2,1) plt.plot(range(1,kf.get_n_splits()+1),np.array(training_error).ravel(),'o-') plt.xlabel('No of folds') plt.ylabel('Error') plt.title('Training error across folds') plt.subplot(2,2,2) plt.plot(range(1,kf.get_n_splits()+1),np.array(testing_error).ravel(),'o-') plt.xlabel('No of folds') plt.ylabel('Error') plt.title('Testing error across folds') plt.subplot(2,2,3) plt.plot(range(1,kf.get_n_splits()+1),np.array(training_accuracy).ravel(),'o-') plt.xlabel('No of folds') plt.ylabel('Accuracy') plt.title('Testing accuracy across folds') plt.subplot(2,2,4) plt.plot(range(1,kf.get_n_splits()+1),np.array(testing_accuracy).ravel(),'o-') plt.xlabel('No of folds') plt.ylabel('Accuracy') plt.title('Testing accuracy across folds')
Output of plot Linear Regression:
Output plot of Decision Tree:
If we compare the two algorithms for the Linear Regression, it is clearly shown in the 1st and 2nd plots, the error in training and testing are nearly the same, but it is significantly high, and for the accuracy shown in the 3rd and 4th plot, both train and test accuracy are nearly the same but again it is significantly low. So from the above explanation, can you guess what the problem with that model is? If you have guessed underfitting, then yes, you are right it is a problem of Underfitting. The Linear Regression model fails to learn patterns associated with the training data set and also fails to generalize it on the testing set.
From the plot of the Decision Tree, if you see the error plots 1st and 2nd for training and testing respectively, surprisingly, the error is literally zero for the training set and for the testing it has shown a huge amount of error. The same thing is observed for the accuracy plots also where training accuracy is 100%, and on the other hand, testing accuracy is more than –80%. Getting accuracy in minus means, for the fold number 16 the regression line is not following the trend of the data and it does not make any sense. So you might have guessed it is an overfitting problem. As explained earlier, while training a Decision Tree, the algorithm learns too much from data. That’s why it failed on the testing dataset.
Here is the Colab Notebook for the above implementation of code
So is there any way to deal with these problems? The most commonly encountered problem is overfitting, and also, it is more important to know whether a model is Overfitted rather than Underfitted. Because evaluation on testing is far different from actual results that we care about most. Here you can find various techniques used to deal with Overfitting problems and Underfitting problems.
Again, it is a black box kind of thing, where you have to train and test your model for various algorithms to limit these problems.