Tackling Underfitting And Overfitting Problems In Data Science


One of the major challenges in data science, especially concerning machine learning, is how well the models align themselves to the training data. Underfitting and overfitting are familiar terms while dealing with the problem mentioned above.

For the uninitiated, in data science, overfitting simply means that the learning model is far too dependent on training data while underfitting means that the model has a poor relationship with the training data. Ideally, both of these should not exist in models, but they usually are hard to eliminate.

Overcoming Overfitting

ML experts and statisticians often have different techniques for bringing down overfitting in ML models. The popular ones stand out to be cross-validation and regularisation. These methods are proven to be effective in understanding the overfit data. Apart from these techniques, there are other ways to eliminate overfitting in models.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Generalisation: For example, make sure the data leads to generalisation rather than just acting as training data. This can be done by feeding more data to the model. More data also means improved accuracy achieved by the model. However, this makes the model, computation and memory-intensive.

Data Augmentation: As a result, another technique called data augmentation comes into the picture. Instead of giving loads of data, improvising and reworking on the existing data can go a long way in reducing overfitting.

Example: Neural networks, which are mostly used in pattern recognition tasks, are prone to overfitting. The larger the network, the complex the functions it creates as a consequence. Hence, an optimum size for the right statistical fit is key. This can be done through a number of methods. The best among them would be retraining neural networks since it is comparably simple and does not involve tweaking much of the parameters.

Generally, overfitting occurs in nonlinear ML models since there are many variables at play to decide the relationship of data in the model. This itself makes the model predict various factors. A better way to address this problem can be methods like k-cross validation. Here, the model is tested k-times for different subsets on the data and can be checked to see how it performs for new data. Any overfitting observed will eventually be diminished.

Lately, ensemble methods such as Bayesian averaging, Boosting and Bagging have indirectly assisted in eliminating overfitting. How? Since ensemble methods deal with complex ML models, they take on the combined overfitting possibilities present in these models. Boosting and Bagging are the two most used methods than Bayesian averaging.

Eliminating Underfitting

Although underfitting is comparatively observed lesser in ML models, it should not be overlooked. To begin with, the general norm here is lack of sense between the data and model. What this means is either the model is way too simple to establish a stable learning pattern or performs very poorly with the training data.

Experts suggest that this problem can be alleviated by simply using more (good!) data for the project. In addition, the following ways can also be used to tackle underfitting.

  • Increase the size or number of parameters in the ML model.
  • Increase the complexity or type of the model.
  • Increasing the training time until cost function in ML is minimised.

Example: Converting a linear model’s data into non-linear data. In this case, the transformation of the model leads to it being more unpredictable with respect to any new as well as training data.


Both overfitting and underfitting should be reduced at the best. As ML expert Jason Brownlee perfectly puts it, a statistically “good fit” is what matters when it comes to choosing an ML model. This can only be done with repeated testing of the model with different data and see where it falls along the lines of overfitting and underfitting.

Furthermore, before starting with an ML model to solve a problem, it is also suggested to take a hard look into the data too!. After all, there might also be the possibility of conflict with the type of data used in the model.

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox