Between October and December 2016, Kaggle organised a competition with over 3,000 participants, competing to predict the loss value associated with American insurance company Allstate. In 2017, Google scholar Alexy Noskov won second position in the competition. In a blog post on Kaggle, Noskov walked readers through his work. The primary models he employed were neural networks and XGBoost, a variant of Gradient Boosting Machines (GBM).
A method of machine learning boosting, Gradient boosting, combines various simple models with limited performance levels (like weak models or weak learners) into a single composite one. In 1988, Michael Kearns said that the goal of boosting was to make ‘an efficient algorithm for converting relatively poor hypotheses into very good hypotheses’.
Two widely employed boosting algorithms are Adaptive Boosting, also known as AdaBoost, and Gradient Boosting.
The origin: AdaBoost
AdaBoost, which is short for Adaptive Boosting, was the first successful boosting ensemble algorithm. The most commonly used weak learners here are decision trees with the depth of one, due to which they are known as decision stumps. AdaBoost works by weighing observations and training a set of weak learners sequentially. Here, more weight is given to samples fitted worse in previous steps. Eventually, the complete set of learners are combined as a single complex classifier.
In 1998, Leo Breiman formulated AdaBoost as a gradient descent with a particular loss function. Taking this further, Jerome Friedman, in 1999, came up with the generalisation of boosting algorithms, and thus, a new method: Gradient Boosting Machines.
One can find Friedman’s paper detailing Gradient Boosting here.
Today, AdaBoost is regarded as a particular case of Gradient Boosting in terms of loss functions.
Gradient Boosting is a machine learning algorithm made up of Gradient descent and Boosting. Gradient Boosting has three primary components: additive model, loss function, and a weak learner; it differs from Adaboost in some ways.
As mentioned earlier, the first of these is in terms of the loss function. Boosting utilises various loss functions. AdaBoost minimises the exponential loss function that makes the algorithm vulnerable to outliers. Gradient Boosting, on the other hand, allows any differentiable loss function to be used. This makes Gradient Boosting more stable than AdaBoost upon meeting outliers.
Secondly, Gradient Boosting predicts the error left by the previous model. It is done through a loss function optimisation, conducted through gradient descent.
But before that, we look at decision trees.
Gradient Boosting typically uses short, less complex-decision trees instead of decision stumps. Here, many weak learners, which make up the individual decision trees, form one strong learner. Every tree is connected in series, and each particular tree attempts to minimise the error of the previous tree (that is, the residuals). Doing this makes boosting algorithms slow to learn but highly precise.
A loss function is used to determine the residuals from each step. For example, a user could use Mean Squared Error for a regression task and Logarithmic Loss for classification tasks.
Let’s see how this would look mathematically:
An output model y, when fit into a single decision tree, is given by:
y= A1+( B1+x)+e1, where e1 is the residual term from this particular decision tree.
Gradient Boosting fits consecutive decision trees on the residual from previous ones. Keeping this in mind, the consecutive decision trees would be:
e1= A2+( B2+x)+e2
e2= A3+( B3+x)+e3
And so on. Assuming this specific Gradient Boosting model only uses three decision trees, the final model of the decision tree would be:
Gradient Boosting is also prone to some overfitting, which can decrease its performance. One way to go about this is through Stochastic Gradient Boosting. It involves sub-sampling the training dataset and using this to train the individual learners. Doing so reduces the correlation between results from individual learners, leading to a more accurate result.
A second method to improve performance is by a technique called shrinkage. Here, the predictions of each tree are weighed to slow down the algorithm’s learning process instead of adding them together sequentially. However, since lower learning rates require more iterations, this comes at the cost of computational time.
A final method to improve performance is through placing constraints on trees. Adding excessive-decision trees further contributes to overfitting. Other parameters here include tree depth, where shorter trees usually lead to better results. The number of observations per split limits the amount of training data at a step before undergoing a split.
It took more than ten years after introducing GBM for it to become vital components of data science. Since then, however, Gradient Boosting is becoming increasingly popular. A primary reason for this is using GBM’s implementations, such as the Kaggle-popular XGBoost, in various machine learning competitions. XGBoost further uses tricks that make it faster and more accurate than traditional means of Gradient Boosting. Through this, one can easily see Jerome Friedman’s 1999 innovation’s usefulness and hope for it to evolve and fit future data science applications.