 ###### Guide To GPBoost: A Library To Combine Tree-Boosting With Gaussian Process And Mixed-Effects Models # Guide To GPBoost: A Library To Combine Tree-Boosting With Gaussian Process And Mixed-Effects Models GPBoost is an approach and a software library aimed at combining tree-boosting with mixed-effects models and Gaussian Process (GP); hence the name ‘GP + Tree-Boosting’. It was introduced by Fabio Sigrist, a professor from Lucerne University of Applied Sciences and Arts in December 2020 (research paper).

Before going into the details of GPBoost, let us have an overview of the terms ‘Gaussian Process’, ‘Tree-Boosting’ and ‘Mixed-effects Models’ mean.

Gaussian Process:

Gaussian process (GP) is a collection of some random variables such that each finite linear combination of those variables has a normal distribution. It is a probabilistic distribution over functions possible for resolving uncertainty in Machine Learning tasks such as regression and classification. Visit this page for a detailed description of GP.

Tree-Boosting:

Tree-boosting or boosting in decision trees refers to creation of an ensemble of decision trees for improving the accuracy of a single tree classifier or regressor. In tree-boosting, each of the trees in the collection is dependent on its prior trees. As the algorithm proceeds, it learns from the residual of the preceding trees.

Mixed-effects Models:

Mixed-effects models are statistical models which contain random effects (model parameters are random variables) and fixed effects (model parameters are fixed quantities). Read in detail about mixed-effects models here.

## Overview of GPBoost

Originally written in C++, the GPBoost library has a C-language API. Though it combines tree-boosting with GP and mixed-effects models, it also allows us to independently perform tree-boosting as well as using GP and mixed-effects models.

Tree-boosting and GP, two techniques are achieving state-of-the-art accuracy for making predictions have the following advantages, which can be combinedly leveraged using GPBoost.

Pros of GP and mixed-effects models:

• Enable making probabilistic predictions for quantifying uncertainties.
• Allows dependency modelling, i.e. finding a model which can describe dependencies among variables.

Pros of tree-boosting:

• Can handle missing values on its own while making predictions.
• Tree boosting provides scale-invariance (i.e. universality) for uniform transformations of the feature variables used for prediction.
• Can automatically model discontinuities, non-linearities and complex interactions.
• Robust to multicollinearity among variables used for prediction as well as outliers.

GPBoost algorithm

The label/response variable for GPBoost algorithm is assumed to be of the form:

`y = F(X) + Zb + xi `              …(i)

where,

X: covariates/ features/ predictor variables

F: non-linear mean function (predictor function)

Zb: random effects which can comprise Gaussian process, grouped random effects or a sum of both

xi: independent error term

Training the GPBoost algorithm refers to learning the hyperparameters (called covariance parameters) of the random effects and F(X) using an ensemble of decision trees. In simple terms, the GPBoost algorithm is a boosting algorithm that iteratively learns the hyperparameters using natural gradient descent or Nesterov accelerated gradient descent and adds a decision tree to the ensemble using Newton and/or gradient boosting. Trees then learn using the library LightGBM.

## Practical implementation

Here’s a demonstration of combining Tree-Boosting with GP models using GPBoost Python library. The code has been implemented using Google Colab with Python 3.7.10, shap 0.39.0 and gpboost 0.5.1 versions. Step-wise explanation of the code is as follows:

1. Install GPBoost library

`!pip install gpboost`

1. Install SHAP (SHapely Additive exPlanations) library for explaining output of the GP model.

`!pip install shap`

1. Import required libraries
``` import numpy as np
import gpboost as gp
import shap ```
1. Define parameters for simulating a Gaussian process
``` sigma2_1 = 0.35  # marginal variance of GP
"""
range parameter which controls how fast the functions sampled from the
Gaussian process oscillates
"""
rho = 0.1
sigma2 = 0.1  # error variance
num_train = 200   # number of training samples
# number of grid points on each axis for simulating the GP on a grid for visualization
num_grid_pts = 50   ```
1. Define locations for training points (excluding upper-right rectangle).
``` #numpy.coulmn_stack() stacks 1D arrays as columns of a 2D array.
coordinates = np.column_stack( (np.random.uniform(size=1)/2,
np.random.uniform(size=1)/2))
“””
numpy.random.uniform() draws samples from a uniform distribution. size=1 means one sample will be drawn.
“””
#While the number of coordinates is less than that of training samples
while coordinates.shape < num_train:
#Draw 2 random samples from uniform distribution
coordinate_i = np.random.uniform(size=2)
#If atleast one of those 2 coordinates is less than 0.6
if not (coordinate_i >= 0.6 and coordinate_i >= 0.6):
#stack the coordinates row-wise using numpy.vstack()
coordinates = np.vstack((coordinates,coordinate_i)) ```
1. Define test points’ locations on a rectangular grid
``` “””
Initialize 2 arrays s1 and s2 (of number of grid points * number of grid points dimension) with ones
“””
s1 = np.ones(num_grid_pts * num_grid_pts)
s2 = np.ones(num_grid_pts * num_grid_pts)
#Update the s1 and s2 arrays
for i in range(num_grid_pts):
for j in range(num_grid_pts):
s1[j * num_grid_pts + i] = (i + 1) / num_grid_pts
s2[i * num_grid_pts + j] = (i + 1) / num_grid_pts
#Stack the arrays s1 and s2 as test coordinates
coordinates_test = np.column_stack((s1, s2)) ```
1. Compute total number of data points as (number of grid points^2) + (number of training samples)

`num_total = num_grid_pts**2 + num_train`

Compute total number of grid coordinates

`coordinates_total = np.vstack((coordinates_test,coordinates))`
1. Create distance matrix
``` #Initialize the matrix (of num_total * num_total dimension) with zeroes
D = np.zeros((num_total, num_total))
#Update the distance matrix
for i in range(0, num_total):
for j in range(i + 1, num_total):
D[i, j] = np.linalg.norm(coordinates_total[i, :] - coordinates_total[j, :])
D[j, i] = D[i, j] ```
1. Compute noise standard deviation:
``` Sigma = sigma2_1 * np.exp(-D / rho) + np.diag(np.zeros(num_total) + 1e-10)
C = np.linalg.cholesky(Sigma) ```

Compute random samples from a normal distribution (as many as total number of data

###### Guide To GPyTorch: A Python Library For Gaussian Process Models

points) and perform its dot product with parameter C.

`b_total = C.dot(np.random.normal(size=num_total))`

Prepare training data GP

`b = b_total[(num_grid_pts*num_grid_pts):num_total] `

1. Define mean function
``` Define features set
X = np.random.rand(num_train, 2)
Define non-linear mean function F(X)
F_X = f1d(X[:, 0])  ```

Compute independent error term

`xi = np.sqrt(sigma2) * np.random.normal(size=num_train)`

Compute the response variable (known as ‘label’) using the equation (i).

`y = F_X + b + xi`

1. Prepare test data
``` “””
Select evenly spaced numbers (as many as square of number of grid points) in the range [0,1]
“””
x = np.linspace(0,1,num_grid_pts**2)
x[x==0.5] = 0.5 + 1e-10
#Test set features
X_test = np.column_stack((x,np.zeros(num_grid_pts**2)))
#Test set labels
y_test = f1d(X_test[:, 0]) + b_total[0:(num_grid_pts**2)] + np.sqrt(sigma2) * np.random.normal(size=(num_grid_pts**2)) ```
1. Model training
``` # Create Gaussian process model
gpmod = gb.GPModel(gp_coords=coordinates, cov_function="exponential")
“””
cov_function denoted the covariance function for GP. ‘exponential’, ‘matern’, ‘gaussian’ and ‘powered_exponential’ are the possible values; ‘exponential’ being the default one
“””
#Create dataset for GP using features set X and labels y
train_data = gb.Dataset(X, y)
#Define a dictionary for parameters of the GP
parameters = { 'objective': 'regression_l2', 'learning_rate': 0.01,
'max_depth': 3, 'min_data_in_leaf': 10, 'num_leaves': 2**10, 'verbose': 0 }
#Train the GP model with provided parameters
model_train = gb.train(params=parameters, train_set=train_data,
gp_model=gpmod, num_boost_round=247)
#num_boost_round denotes the number of boosting iterations
#Print the covariance parameters estimated by the GP model
print("Estimated covariance parameters:")
gpmod.summary() ```

Output:

``` Estimated covariance parameters:
Covariance parameters
['Error_term', 'GP_var', 'GP_range']
[1.28340739e-268 2.90711171e-001 5.47936824e-002] ```
1. Make predictions bypassing the GP features (i.e. prediction locations/coordinates here) and predictor variables for the tree ensemble to the predict() function. The function separately returns predictions for the tree ensemble and the GP. Add both of them to get a single point prediction.
``` prediction = model_train.predict(data=X_test, gp_coords_pred=coordinates_test, predict_var=True)
“””
gp_coords_pred denotes the features for GP.predict_var=True means predictive variances will also be computed in addition to predictive mean
“””
#Add the predictions for GP and tree ensemble
y_pred = prediction['fixed_effect'] + prediction['random_effect_mean']
#Compute and print the mean-squared error
print("Mean square error (MSE): " + str(np.mean((y_pred-y_test)**2))) ```

Output:

`Mean square error (MSE): 0.367071629709704`

1. Interpret the trained model using SHAP library
``` shap_values = shap.TreeExplainer(model_train).shap_values(X)
#Display the summary of gpmod model
shap.summary_plot(shap_values, X) ```

Output:

``` #Display shap dependence plot
shap.dependence_plot("Feature 0", shap_values, X) ```

Output:

1. Trees used as base learners for boosting have several parameters. Here, we tune those parameters
``` # Create random effects model
gp_mod = gb.GPModel(gp_coords=coordinates, cov_function="exponential")
#Set parameters for estimating the GP covariance parameters
#’optimizer_cov’ denotes optimizer to be used for estimation
#Create dataset for GP
train_set = gb.Dataset(X, y)
# Define parameter grid
parameter_grid = {'learning_rate': [0.1,0.05,0.01], 'min_data_in_leaf': [5,10,20,50],
'max_depth': [1,3,5,10,20]}
# Other parameters to be tuned
parameters = { 'objective': 'regression_l2', 'verbose': 0, 'num_leaves': 2**17 }
“””
grid_search_tune_parameters() randomly chooses tuning parameters from the defined grid using cross-validation
“””
opt_parameters = gb.grid_search_tune_parameters(
param_grid=parameter_grid,
params=parameters,
num_try_random=20,
"""
grid search will be run 20 times, each time using a different combination of tuning parameters
"""
nfold=4,  #value of ‘k’ for k-fold cross-validation
gp_model=gp_mod,
use_gp_model_for_validation=True,
train_set=train_set,
verbose_eval=1,
num_boost_round=1000,  #number of boosting iterations
early_stopping_rounds=5,  #number of early stopping iterations
seed=1,
metrics='l2')
"""
Best parameters’s combination will be returned in each of the 20 rounds. Display them.
"""
print("Best number of iterations: " + str(opt_parameters['best_iter']))
print("Best score: " + str(opt_parameters['best_score']))
print("Best parameters: " + str(opt_parameters['best_params'])) ```

Output: