# Story of Gradient Boosting: How It Evolved Over Years

Between October and December 2016, Kaggle organised a competition with over 3,000 participants, competing to predict the loss value associated with American insurance company Allstate. In 2017, Google scholar Alexy Noskov won the second position in the competition. In a blog post on Kaggle, Noskov walked readers through his work. The primary models he employed were neural networks and XGBoost, a variant of Gradient Boosting Machines (GBM).

A method of machine learning boosting, Gradient boosting, combines various simple models with limited performance levels (like weak models or weak learners) into a single composite one. In 1988, Michael Kearns said that the goal of boosting was to make ‘an efficient algorithm for converting relatively poor hypotheses into very good hypotheses’.

#### THE BELAMY

AdaBoost, which is short for Adaptive Boosting, was the first successful boosting ensemble algorithm. The most commonly used weak learners here are decision trees with the depth of one, due to which they are known as decision stumps. AdaBoost works by weighing observations and training a set of weak learners sequentially. Here, more weight is given to samples fitted worse in previous steps. Eventually, the complete set of learners is combined as a single complex classifier.

In 1998, Leo Breiman formulated AdaBoost as a gradient descent with a particular loss function. Taking this further, Jerome Friedman, in 1999, came up with the generalisation of boosting algorithms, and thus, a new method: Gradient Boosting Machines.

One can find Friedman’s paper detailing Gradient Boosting here

Today, AdaBoost is regarded as a particular case of Gradient Boosting in terms of loss functions.

Gradient Boosting is a machine learning algorithm made up of Gradient descent and Boosting. Gradient Boosting has three primary components: additive model, loss function, and a weak learner; it differs from Adaboost in some ways.

As mentioned earlier, the first of these is in terms of the loss function. Boosting utilises various loss functions. AdaBoost minimises the exponential loss function that makes the algorithm vulnerable to outliers. Gradient Boosting, on the other hand, allows any differentiable loss function to be used. This makes Gradient Boosting more stable than AdaBoost upon meeting outliers.

Secondly, Gradient Boosting predicts the error left by the previous model. It is done through a loss function optimisation, conducted through gradient descent.

But before that, we look at decision trees.

### Decision trees

Gradient Boosting typically uses short, less complex-decision trees instead of decision stumps. Here, many weak learners, which make up the individual decision trees, form one strong learner. Every tree is connected in series, and each particular tree attempts to minimise the error of the previous tree (that is, the residuals). Doing this makes boosting algorithms slow to learn but highly precise.

A loss function is used to determine the residuals from each step. For example, a user could use Mean Squared Error for a regression task and Logarithmic Loss for classification tasks.

Let’s see how this would look mathematically:

An output model y, when fit into a single decision tree, is given by:

y= A1+( B1+x)+e1, where e1 is the residual term from this particular decision tree.

Gradient Boosting fits consecutive decision trees on the residual from previous ones. Keeping this in mind, the consecutive decision trees would be:

e1= A2+( B2+x)+e2

e2= A3+( B3+x)+e3

And so on. Assuming this specific Gradient Boosting model only uses three decision trees, the final model of the decision tree would be:

y=A1+A2+A3+(B1+x)+(B2+x)+(B3+x)+e3

### Improving performance

Gradient Boosting is also prone to some overfitting, which can decrease its performance. One way to go about this is through Stochastic Gradient Boosting. It involves sub-sampling the training dataset and using this to train the individual learners. Doing so reduces the correlation between results from individual learners, leading to a more accurate result.

A second method to improve performance is by a technique called shrinkage. Here, the predictions of each tree are weighed to slow down the algorithm’s learning process instead of adding them together sequentially. However, since lower learning rates require more iterations, this comes at the cost of computational time.

A final method to improve performance is through placing constraints on trees. Adding excessive-decision trees further contributes to overfitting. Other parameters here include tree depth, where shorter trees usually lead to better results. The number of observations per split limits the amount of training data at a step before undergoing a split.

It took more than ten years after introducing GBM for it to become vital components of data science. Since then, however, Gradient Boosting is becoming increasingly popular. A primary reason for this is using GBM’s implementations, such as the Kaggle-popular XGBoost, in various machine learning competitions. XGBoost further uses tricks that make it faster and more accurate than traditional means of Gradient Boosting. Through this, one can easily see Jerome Friedman’s 1999 innovation’s usefulness and hope for it to evolve and fit future data science applications.

## More Great AIM Stories

### Python Guide To Google’s T5 Transformer For Text Summarizer

I am an economics undergrad who loves drinking coffee and writing about technology and finance. I like to play the ukulele and watch old movies when I'm free.

## Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

### Analytics India Industry Study 2022

The analytics industry recorded a substantial increase of 34.5% on a year-on-year basis in 2022, with the market value reaching USD 61.1 billion.

### Double debiased machine learning for evaluation and inference

Debiased ML combines bias correction and sample splitting to compute scalar summaries.

### MS Word is playing catch up with Google Docs

Microsoft launched MS Word Online in 2020.

### From Gynaecology to Data Science: The journey of Dr Nitin Paranjape

Even experts in this field spend close to 80 percent of their time cleaning up data, which is an absolute waste of humanity.

### How to select the best set of features using SVM?

Do you want to know how to select best set of features using SVM.Here is a complete implementation.

### Refuelling India with bigger, better convenience store: 7-Eleven

7-Eleven has over 81,000 stores in 18 countries and regions. In India, the company is working with master franchisee Reliance Retail to modernise the small-retail environment and bring greater convenience to shoppers.

### How AI is used for data access governance

Blast radius is a way of measuring the total impact of a potential security breach.

### Non-profits in AI innovation that we need more of

Cohere says that Cohere For AI will work to solve some of the industry’s toughest challenges by contributing “fundamental research” to the open-source community.

### Gradient Ascent: When to use it in machine learning?

Gradient ascent maximizes the loss function of the algorithm