Ensembling is a mechanism in machine learning that serves as the foundation for a variety of powerful algorithms. In general, ensembling is a learning technique in which many models are combined to solve a problem. Similarly, in deep learning, the ensemble can be used where nonlinearity is high and single architectural-based models perform poorly most of the time. In the context of ensemble learning, we will discuss what ensemble learning is and how it can be used in deep learning using the AdaNet framework in this article. The main points to be covered in this article are listed below.
Table of Contents
- What is Ensemble Learning?
- Types of Ensemble Learning
- Ensembling Neural Nets using AdaNet
- Ensemble Mechanism
- Adaptive Search Mechanism
Let’s start the discussion by understanding what an ensemble is.
What is Ensemble Learning?
In machine learning, ensemble approaches combine many weak learners to achieve better prediction performance than each of the constituent learning algorithms alone. A machine learning ensemble, in contrast to a statistical ensemble in statistical mechanics, which is usually infinite, consists largely of a relatively small number of potential models but allows for a significantly more flexible structure to exist within those possibilities.
A hypothesis space is searched by supervised learning algorithms for a suitable hypothesis that will deliver accurate predictions for a particular circumstance. Even when the hypothesis space contains hypotheses that are ideally suited to a certain scenario, choosing the optimal one might be challenging. Ensembles integrate many hypotheses to create a new (hopefully superior) hypothesis.
The word “ensemble” refers to approaches that use the same underlying learner to create several hypotheses. Multiple classifier systems is a larger phrase that includes hybridization of hypotheses that are not driven by the same base learner.
Types of Ensemble Learning
While there are nearly infinite ways to accomplish this, perhaps three classes of ensemble learning techniques are most commonly discussed and implemented in practice. Their popularity is due to their ease of use and ability to solve a wide variety of predictive modelling problems. The three methods are bagging, stacking, and boosting. We’ll go over each one briefly now.
The generation of so-called bootstrapped data sets is the first step in the bagging process. The number of elements chosen in each bootstrapped set is the same as in the original training dataset, but elements are chosen at random with replacement.
As a result, in a given bootstrapped set, a given sample from the original training set may appear zero, one, or multiple times. Out-of-bag sets are another byproduct of the bootstrapping process. It is a statistical technique for calculating the statistical value of a data sample using small datasets. Producing several distinct bootstrap samples, estimating a statistical quantity, and determining the mean of the estimates can result in a better overall estimate of the desired quantity.
Similarly, a large number of training datasets can be prepared, estimated, and projected. Predictions from many models are typically superior to predictions from a single model fitted directly to the training dataset.
The technique of training a learning algorithm to incorporate the predictions of several learning algorithms is known as stacking (also known as a stacked generalization). All of the other algorithms are trained first using the available data, and then a combiner algorithm is taught to make a final forecast using the predictions of all of the other algorithms as supplementary inputs.
If an arbitrary combiner technique is used, stacking might reflect any of the ensemble approaches, however, in fact, the combiner is usually a logistic regression model. In the vast majority of circumstances, stacking outperforms employing a single trained model. It’s been proved to work in both supervised and unsupervised learning environments (regression, classification, and distance learning).
Boosting entails forming an ensemble one step at a time by training each new model instance to emphasize the training examples that prior models misclassified. Boosting has been demonstrated to be more accurate than bagging in some circumstances, but it also has a higher risk of overfitting the training data. Although some newer algorithms are said to generate greater results, Adaboost is by far the most prevalent implementation of boosting.
At the very first round of boosting, the sample training data is assigned an identical weight (uniform probability distribution). After that, the data is delivered to a base learner (say L1). The misclassified occurrences by L1 are given a larger weight than the correctly classified examples, but the total probability distribution remains the same. This boosted data is then sent on to the second base learner (let’s call it L2), and so on. Following that, the results are pooled in the form of voting.
Ensembling Neural Nets using AdaNet
AdaNet is a TensorFlow-based lightweight framework for learning high-quality models automatically with minimum expert interaction. AdaNet provides a comprehensive framework for learning not only neural network design but also how to ensemble models to get even better results.
AdaNet is simple to use and produces high-quality models, saving ML practitioners time by creating an adaptive method for learning a neural architecture as an ensemble of subnetworks and saving ML practitioners the time spent identifying ideal neural network topologies. AdaNet can generate a varied ensemble by adding subnetworks of various depths and widths, and it can trade off performance gain with the number of parameters.
AdaNet offers a one-of-a-kind adaptive computation graph that can be used to build models that add and remove operations and variables over time while maintaining the optimizations and scalability of TensorFlow’s graph model. Users can use this adaptive graph to create progressively growing models (such as boosting style), architecture search algorithms, and hyper-parameter tuning without having to manage an external for-loop.
Now further we will discuss two important mechanisms of AdaNet which are responsible for the ensembling.
Ensembles are the key first-class objects in AdaNet. Every model you train will be a part of some sort of ensemble. An ensemble is made up of one or more subnetworks, each of which has its outputs combined by an ensembler (Shown below Figure).
Because ensembles are model-independent, a subnetwork can be as complex as a deep neural network or as simple as an if-statement. All that matters is that the ensembler can combine the subnetworks’ outputs to form a single prediction for a given input tensor.
Adaptive Search Mechanism
The AdaNet method iteratively executes the following architecture search to construct an ensemble of subnetworks in the animation shown below:
- Produces a set of candidate subnetworks.
- Trains the subnetworks in any way the user specifies.
- The performance of the subnetworks as part of the ensemble, which is a one-network ensemble in the first iteration, is evaluated.
- The subnetwork that improves the ensemble performance the most is included in the ensemble for the following iteration.
- Prunes the graph’s other subnetworks.
- Adapts the subnetwork search space based on the results of the current iteration.
- The process progresses to the next iteration.
Through this article, we have discussed what is ensemble learning and seen what are the majorly used types of it. Normally, we refer ensemble learning to the statistical-based ML algorithms such as tree-based, linear, and probability-based algorithms. Similarly, Deep ensemble learning models combine the benefits of both deep learning models and ensemble learning, resulting in a greater generalization performance for the final model. To picture it practically, we went through a framework called AdaNet.