A machine learning model is a mathematically designed algorithm that is trained with a collection of data. The nature of data defines the problem, such as regression, classification and language generation. Whatever the model’s performance during training and evaluation is, the model will surely struggle if it encounters some unforeseen data, i.e. data with a different pattern. Since the future is not in our hands, one can not think about unforeseen patterns while developing a model. Many algorithms can be used to model a specific problem. A particular algorithm can also be hyper-tuned to arrive at different models. Each model has its own ability to generalize on the data. In other words, each model can grasp certain key patterns that other models may miss. Thus, a model may fail to perform well on unforeseen data, but another model may perform better. When yet another new pattern arrives in data, the former model may perform better than the latter. Therefore, one can never assure that a specific model or algorithm will perform great for a specific kind of problem.
In this scenario, how can one face the issue of unforeseen patterns in the future data? Ensemble Learning gives a robust solution to this issue. Ensemble Learning is the process of gathering more than one machine learning model for a task in a mathematical way to obtain better performance. If a few different models are developed, each for a unique expected pattern, their ensemble can tackle any pattern. Therefore, an ensemble of many models can outperform any individual machine learning model that it contains.
Different Ensemble Approaches
Ensemble methods are classified based on the mathematical way they are formulated. The most popular ensemble approaches are:
Sign up for your weekly dose of what's up in emerging technology.
Bagging is the generalized approach of training a couple of individual weak models on subsets of training data and obtaining the final prediction from the individual predictions. The performance of one model is independent of the other. Hence bagging is also called parallel ensemble learning. Each model is training on a subset of training data sampled with replacement. The selection of the data subset is completely randomized. An example data in the training set may either be selected for training or be left. In other words, an example data may be selected for training none of the models, one, or more than one model. This random sampling helps reduce variance in the data. Thus bagging gives a robust ensemble for data with high variance. The sampling approach is called bootstrap sampling. This leads to the name Bootstrap Aggregating, or in short, Bagging. The individual models are independent of each other. It is a custom that individual models are formed from the same algorithm with different configurations. The final predictions of individual models are either voted for by the majority or averaged to arrive at the global prediction. Decision trees are famous for modeling on high-variance data. A randomized ensemble of a number of decision trees is called a Random Forest. Random Forests are well known for their performance in both classification and regression problems when the data has high variance.
Boosting is a very systematic approach compared to bagging. In boosting, the training data is sampled without replacement such that each data example is used exactly once. In other words, training data is split into subsets whose count is equal to the number of individual models used. Boosting a sequential ensemble learning approach, in which a model is first trained on one subset. Another model is trained on another training subset to reduce the error caused by the first model. The successor model is trained on yet another training subset to reduce the error caused by its predecessor model. Thus each individual model is dependent on other models. The individual models may be formed from the same algorithm or different algorithms. However, it should be ensured that the individual models are weak enough not to overfit on its subset so that at the end of the training, the ensemble converges to a solution. Boosting is a good approach when the data has high bias. Since the entire training process intends to reduce error at each step, the subset for each model should be representative of the overall training data. Though the boosting approach is preferred for data with high bias, it also helps mitigate the effects of variance. Gradient Boosting, Adaboost, Extreme Gradient Boost are the most popular boosting algorithms in use.
Stacking is an ensemble learning approach that stands in between bagging and boosting. It encourages a set of individual models, each trained individually on the entire training dataset. It employs a final model that predicts the output modelled on the outputs of individual models. Thus, one final model is used to boost the outputs of independent models. In contrast to bagging and boosting approaches which train their individual models on subsets of data, the individual models in stacking train up on all the training data. This approach is quite useful if the data is limited.
Read more here.
Ensemble Methods in Practical Usage
Decisions in finance, medical diagnosis, e-commerce, weather forecasting, space science, remote sensing, computer security and anomaly detection are very hard to make because of the high bias and/or variance in the data and variations in patterns with respect to time. Ensemble methods help find models that are generalized on the data and yield better results. Boosting or bagging are the mostly preferred approaches in such scenarios.
Find a loan approval example here.
Bagging vs Boosting
We might always come across debates on the best ensemble methods. Tree-based methods are good at data with high variance. Bagging and boosting methods that incorporate tree-based individual models become quite popular in numerous applications. The top-preferred bagging method that employs tree-based individual models is Random Forest, and the top-preferred boosting method that employs tree-based individual models is Extreme Boost Gradient. Both the methods can be applied to regression problems and classification problems. However, the selection of a method is purely based on the training data.
For regression: RandomForestRegressor, XGBRegressor
For classification: RandomForestClassifier, XGBClassifier
Read more about Bagging and Boosting comparisons here.
The oldest boosting method is Adaptive Boosting, technically known as AdaBoost. The concept behind the implementation of AdaBoost helped develop numerous boosting methods, including the Extreme Gradient Boosting method. Similar to most other boosting methods, AdaBoost also supports both regression and classification problems.
By default, AdaBoost employs Decision Trees (classifier or regressor) as its individual methods. However, users can opt for any suitable algorithms as individual methods.
Read more about AdaBoost implementation here.
How to use Different Algorithms in Ensemble?
We have discussed that ensemble methods such as Random Forests, AdaBoost, Extreme Gradient Boosting employ decision trees as individual models by default. But, we wish to attempt different algorithms and choose the best decisions later. Voting is a suitable Ensemble approach that follows the bagging strategy and selects the final output by performing majority voting. For instance, we may build a couple of logistic regression models, decision tree classifiers, support vector machines, K-nearest neighbours classifier and Naive Bayes Classifier for a classification task. Each classifier is independent of each other in training and making predictions. The final classes can be determined by performing majority voting among the individual results. This approach is generally called hybrid ensemble learning.
Read more about hybrid ensemble learning here.
Handling Imbalanced Data via Ensemble Learning
Class imbalance is a major problem in classification applications. When most of the data belong to one specific class, and a few of the data belong to another specific class, a classifier struggles to generalize classes. Because it sees a lot of data and learns well in one class and fails to formulate patterns belonging to the minority class. Ensemble Learning has a collection of weak classifiers that can not learn the majority class pattern like a strong individual classifier. Thus a strong individual classifier gives a poor performance on imbalanced classes, while an ensemble of weak classifiers performs greatly.
Read more here.
Ensemble of Deep Neural Networks
The use of ensemble methods began with regression and classification problems with machine learning models. However, with improved compute power nowadays, ensemble methods become a gift to deep learning. There is a general complaint on deep learning architectures’ explainability. It is hard to understand and explain what exactly happens at hidden layers. Without a clear understanding, it becomes difficult to hyper-tune or alter parameters in a deep learning model. But an ensemble of deep learning models with average performances (i.e., not overfitting) gives extraordinary results in many cases than a carefully curated single model.
Read more here.