One of the major challenges in data science, especially concerning machine learning, is how well the models align themselves to the training data. Underfitting and overfitting are familiar terms while dealing with the problem mentioned above.
For the uninitiated, in data science, overfitting simply means that the learning model is far too dependent on training data while underfitting means that the model has a poor relationship with the training data. Ideally, both of these should not exist in models, but they usually are hard to eliminate.
Sign up for your weekly dose of what's up in emerging technology.
ML experts and statisticians often have different techniques for bringing down overfitting in ML models. The popular ones stand out to be cross-validation and regularisation. These methods are proven to be effective in understanding the overfit data. Apart from these techniques, there are other ways to eliminate overfitting in models.
Generalisation: For example, make sure the data leads to generalisation rather than just acting as training data. This can be done by feeding more data to the model. More data also means improved accuracy achieved by the model. However, this makes the model, computation and memory-intensive.
Data Augmentation: As a result, another technique called data augmentation comes into the picture. Instead of giving loads of data, improvising and reworking on the existing data can go a long way in reducing overfitting.
Example: Neural networks, which are mostly used in pattern recognition tasks, are prone to overfitting. The larger the network, the complex the functions it creates as a consequence. Hence, an optimum size for the right statistical fit is key. This can be done through a number of methods. The best among them would be retraining neural networks since it is comparably simple and does not involve tweaking much of the parameters.
Generally, overfitting occurs in nonlinear ML models since there are many variables at play to decide the relationship of data in the model. This itself makes the model predict various factors. A better way to address this problem can be methods like k-cross validation. Here, the model is tested k-times for different subsets on the data and can be checked to see how it performs for new data. Any overfitting observed will eventually be diminished.
Lately, ensemble methods such as Bayesian averaging, Boosting and Bagging have indirectly assisted in eliminating overfitting. How? Since ensemble methods deal with complex ML models, they take on the combined overfitting possibilities present in these models. Boosting and Bagging are the two most used methods than Bayesian averaging.
Although underfitting is comparatively observed lesser in ML models, it should not be overlooked. To begin with, the general norm here is lack of sense between the data and model. What this means is either the model is way too simple to establish a stable learning pattern or performs very poorly with the training data.
Experts suggest that this problem can be alleviated by simply using more (good!) data for the project. In addition, the following ways can also be used to tackle underfitting.
- Increase the size or number of parameters in the ML model.
- Increase the complexity or type of the model.
- Increasing the training time until cost function in ML is minimised.
Example: Converting a linear model’s data into non-linear data. In this case, the transformation of the model leads to it being more unpredictable with respect to any new as well as training data.
Both overfitting and underfitting should be reduced at the best. As ML expert Jason Brownlee perfectly puts it, a statistically “good fit” is what matters when it comes to choosing an ML model. This can only be done with repeated testing of the model with different data and see where it falls along the lines of overfitting and underfitting.
Furthermore, before starting with an ML model to solve a problem, it is also suggested to take a hard look into the data too!. After all, there might also be the possibility of conflict with the type of data used in the model.