Understanding XGBoost Algorithm In Detail

In this article, I’ll be discussing how XGBoost works internally to make decision trees and deduce predictions.
XGBoost is All You Need

XGBoost or extreme gradient boosting is one of the well-known gradient boosting techniques(ensemble) having enhanced performance and speed in tree-based (sequential decision trees) machine learning algorithms. XGBoost was created by Tianqi Chen and initially maintained by the Distributed (Deep) Machine Learning Community (DMLC) group. It is the most common algorithm used for applied machine learning in competitions and has gained popularity through winning solutions in structured and tabular data. It is open-source software. Earlier only python and R packages were built for XGBoost but now it has extended to Java, Scala, Julia and other languages as well.

In this article, I’ll be discussing how XGBoost works internally to make decision trees and deduce predictions.

To understand XGboost first, a clear understanding of decision trees and ensemble learning algorithms is needed.

Difference between different tree-based techniques:

XGBoost falls under the category of Boosting techniques in Ensemble Learning. Ensemble learning consists of a collection of predictors which are multiple models to provide better prediction accuracy. In Boosting technique the errors made by previous models are tried to be corrected by succeeding models by adding some weights to the models. 

Basic Boosting Architecture:

Unlike other boosting algorithms where weights of misclassified branches are increased, in Gradient Boosted algorithms the loss function is optimised. XGBoost is an advanced implementation of gradient boosting along with some regularization factors.

Source: link

Features of XGBoost:

  1. Can be run on both single and distributed systems(Hadoop, Spark).
  2. XGBoost is used in supervised learning(regression and classification problems).
  3. Supports parallel processing.
  4. Cache optimization.
  5. Efficient memory management for large datasets exceeding RAM.
  6. Has a variety of regularizations which helps in reducing overfitting.
  7. Auto tree pruning – Decision tree will not grow further after certain limits internally.
  8. Can handle missing values.
  9. Has inbuilt Cross-Validation.
  10. Takes care of outliers to some extent.

XGBoost Algorithm

Let’s look at how XGboost works with an example. Here I’ll try to predict a child’s IQ based on age. For any basic assumption in such statistical data, we can take the average IQ and find how much variance(loss) is present. Residuals are the losses incurred will be calculated after each model predictions.

CHILD’s AGECHILD’s IQRESIDUALS
1020-10
15344
16388

So the average of 20, 34, and 38 is 30.67 for simplicity let’s take it as 30. If we plot a graph keeping y-axis as IQ and x-axis as Age and then we can see the variance in points from the average mark. 

At first, our base model(M0) will give a prediction 30. As from the graph, we know this model suffers a loss which will have some optimisation in the next model(M1). Model M1 will have input as age(independent features) and target as the loss suffered(variances) in M0. Until now it is the same as the gradient boosting technique.

For XGboost some new terms are introduced, 

ƛ -> regularization parameter
Ɣ -> for auto tree pruning
eta -> how much model will converge

Now calculate the similarity score,

Similarity Score(S.S.) =  (S.R ^ 2) / (N + ƛ)
Here, S.R is the sum of residuals,
N is Number of Residuals

At first let's put  ƛ =0, then Similarity Score = (-10+4+8)^2 / 3+0 = 4/3 = 1.33

Let’s make the decision tree using these residuals and similarity scores. I’ve set the tree splitting criteria as Age >10.

Again for these two leaves, we calculate the similarity scores which is 100 and 72. 

After this, we calculate the 

Gain = S.S of the branch before split - S.S of the branch after the split.

Gain = ( 100 + 72 ) – 1.3 

Now we set our Ɣ, which is a value provided to the model at starting and its used during splitting. If Gain>Ɣ then split will happen otherwise not. Let’s assume that Ɣ for this problem is 130 then since the gain is greater than Ɣ, further split will occur. By this method, auto tree pruning will be achieved. The greater the Ɣ value more pruning will be done.

For regularization and preventing overfitting, we must increase the ƛ  which was initially set to 0. But this should be done carefully as greater the ƛ  value lesser the Similarity score, lesser the gain and more the pruning.

New prediction = Previous Prediction + Learning rate * Output

XGboost calls the learning rate as eta and its value is set to 0.3 

For the 2nd reading(Age=15) new prediction = 30 + (0.3 * 6) = 31.8

The outcome is 6 is calculated from the average residuals 4 and 8.

New Residual = 34 – 31.8 = 2.2

Age IQ Residual
1020-7
15342.2
16386.2

This way model M1 will be trained and residuals will keep on decreasing, which means the loss will be optimized in further models.

Conclusion

XGboost has proven to be the most efficient Scalable Tree Boosting Method. It has shown outstanding results across different use cases such as motion detection, stock sales predictions, malware classification, customer behaviour analysis and many more. The system runs way faster on a single machine than any other machine learning technique with efficient data and memory handling. The algorithm’s optimization techniques improve performance and thereby provides speed using the least amount of resources.

Download our Mobile App

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.