 Implementing Bayesian Optimization On XGBoost: A Beginner’s Guide  Probability is an integral part of Machine Learning algorithms. We use it to predict the outcome of regression or classification problems. We apply what’s known as conditional probability or Bayes Theorem along with Gaussian Distribution to predict the probability of a class or a value, given a condition. The pair is also used in optimising hyperparameters for an ML model and the process is known as Bayesian Optimization.

In this article, we will learn to implement Bayesian Optimization to find optimal parameters for any machine learning model.

Bayesian Optimization Simplified

In one of our previous articles, we learned about Grid Search which is a popular parameter-tuning algorithm that selects the best parameter list from a given set of specified parameters. Considering the fact that we initially have no clue on what value to begin with for the parameters, it can only be as good as or slightly better than random search.

For computationally intensive tasks, grid search and random search can be painfully time-consuming with less luck of finding optimal parameters. These methods hardly rely on any information that the model learned during the previous optimizations. Bayesian Optimization on the other hand constantly learns from previous optimizations to find a best-optimized parameter list and also requires fewer samples to learn or derive the best values.

Two common terms that you will come across when reading any material on Bayesian optimization are :

1. Surrogate model and
2. Acquisition function.

The Gaussian process is a popular surrogate model for Bayesian Optimization. What it does is that it defines a prior function that can be used to learn from previous predictions or believes about the objective function.

Acquisition function, on the other hand, is responsible for predicting the sampling points in the search space. Acquisition function follows the exploration and exploitation principle. It is a function that allows the optimizer to exploit an optimal region until a better value is obtained. The goal is to maximise the acquisition function to determine the next sampling point. The terms exploration and exploitation might seem familiar to you if you have learned about Thompson Sampling or Upper Confidence Bound which revolves around the same principle. These are also used as acquisition functions.

A Shallow Dive Into The Math Behind It

The Optimization algorithm The Acquisition Function For a deeper understanding of the math behind Bayesian Optimization check out this link

Implementing Bayesian Optimization For XGBoost

Without further ado let’s perform a Hyperparameter tuning on XGBClassifier.

Given below is the parameter list of XGBClassifier with default values from it’s official documentation

(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective=’binary:logistic’, booster=’gbtree’, n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None)

We will try to find optimal values for max_depth, learning_rate, n_estimators and gamma using Bayesian optimization and we will compare the effects on a model built with default parameters.

Installing Bayesian Optimization

On the terminal type and execute the following command :

pip install bayesian-optimization

If you are using the Anaconda distribution use the following command:

conda install -c conda-forge bayesian-optimization

Dataset: Here we will use XGBClassifier on a preprocessed subset of the training dataset from MachineHack’s Whose Line Is It Anyway: Identify The Author Hackathon.

First, we will use XGBClassifier with default parameters to later compare it with the result of tuned parameters.

XGBClassifier with Default Parameters

#Initializing an XGBClassifier with default parameters and fitting the training data
from xgboost import XGBClassifier
classifier1 = XGBClassifier().fit(text_tfidf, clean_data_train['author'])

• In the above code block, text_tfidf is the TF_IDF transformed texts of the training dataset.

#Predicting for training set
train_p1 = classifier1.predict(text_tfidf)

#Printing the classification report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(train_p1, clean_data_train['author']))

Output: #Accuracy obtained on the training set
cm = confusion_matrix(train_p1, clean_data_train['author'])
acc = cm.diagonal().sum()/cm.sum()
print(acc)

Output:

0.8910786741845392

XGBClassifier With Tuned Parameters

#Importing necessary libraries
from bayes_opt import BayesianOptimization
import xgboost as xgb
from sklearn.metrics import mean_squared_error

#Converting the dataframe into XGBoost’s Dmatrix object
dtrain = xgb.DMatrix(text_tfidf, label=clean_data_val['author'])

#Bayesian Optimization function for xgboost
#specify the parameters you want to tune as keyword arguments
def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate):
params = {'max_depth': int(max_depth),
'gamma': gamma,
'n_estimators': int(n_estimators),
'learning_rate':learning_rate,
'subsample': 0.8,
'eta': 0.1,
'eval_metric': 'rmse'}
#Cross validating with the specified parameters in 5 folds and 70 iterations
cv_result = xgb.cv(params, dtrain, num_boost_round=70, nfold=5)
#Return the negative RMSE
return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

#Invoking the Bayesian Optimizer with the specified parameters to tune
xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth': (3, 10),
'gamma': (0, 1),
'learning_rate':(0,1),
'n_estimators':(100,120)
})

#performing Bayesian optimization for 5 iterations with 8 steps of random exploration with an #acquisition function of expected improvement
xgb_bo.maximize(n_iter=5, init_points=8, acq='ei')

Output: Note:

The the process iterates for 8(init_points) + 5(n_iter) = 13 times

#Extracting the best parameters
params = xgb_bo.max['params']
print(params)

#Converting the max_depth and n_estimator values from float to int
params['max_depth']= int(params['max_depth'])
params['n_estimators']= int(params['n_estimators'])

#Initialize an XGBClassifier with the tuned parameters and fit the training data
from xgboost import XGBClassifier
classifier2 = XGBClassifier(**params).fit(text_tfidf, clean_data_train['author'])

#predicting for training set
train_p2 = classifier2.predict(text_tfidf)

#Looking at the classification report
print(classification_report(train_p2, clean_data_train['author']))

Output: #Attained prediction accuracy on the training set
cm = confusion_matrix(train_p2, clean_data_train['author'])
acc = cm.diagonal().sum()/cm.sum()
print(acc)

Output:

0.9990514833746114

We can see that the prediction for the training set is all exact which even though is practically overfitting, we can see the effect of the optimized parameters on the training set.

The accuracy of prediction with default parameters was around 89% which on tuning the hyperparameters with Bayesian Optimization yielded an impossible accuracy of almost 100%. Bayesian Optimization is a very effective strategy for tuning any ML model.

More Great AIM Stories

Is Depth In Neural Networks Always Preferable? This Research Says the Contrary A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM  A Guide to Hyperparameter Optimization using HpBandSter

HpBandSter is a python package framework that can be used for distributed hyperparameter optimization. Using this framework we can perform a variety of cutting-edge hyperparameter algorithms including the HyperBand  All you need to know about Bayesian marketing mix modeling

Traditional Market Mix Models are not much eligible to equip the hard data with prior knowledge. The simple models are defined with the parameters which are independent of each other. Bayesian Market Mix Models can be eligible to deal with such hard data.  Neural Network Hyperparameter Tuning using Bayesian Optimization

The Bayesian statistics can be used for parameter tuning and also it can make the process faster especially in the case of neural networks. we can say performing Bayesian statistics is a process of optimization using which we can perform hyperparameter tuning.  Hands-On Guide To Algorithm Configuration Using SMAC

Many algorithms belong to the family of tree and ensemble, which are hard for computational problems and exposed to many hyperparameters that can be modified to improve the performance.  However, manually exploring those parameters and setting those for optimized solutions is a rigorous task and often leads to unsatisfactory results.  Guide to Bayesian Optimization Using BoTorch

BoTorch is a library built on top of PyTorch for Bayesian Optimization. It combines Monte-Carlo (MC) acquisition functions, a novel sample average approximation optimization approach, auto-differentiation, and variance reduction techniques.