GPBoost is an approach and a software library aimed at combining tree-boosting with mixed-effects models and Gaussian Process (GP); hence the name ‘**GP **+ Tree-**Boost**ing’. It was introduced by Fabio Sigrist, a professor from Lucerne University of Applied Sciences and Arts in December 2020 (research paper).

Before going into the details of GPBoost, let us have an overview of the terms ‘Gaussian Process’, ‘Tree-Boosting’ and ‘Mixed-effects Models’ mean.

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

**Gaussian Process:**

Gaussian process (GP) is a collection of some random variables such that each finite linear combination of those variables has a normal distribution. It is a probabilistic distribution over functions possible for resolving uncertainty in Machine Learning tasks such as regression and classification. Visit this page for a detailed description of GP.

**Tree-Boosting:**

Tree-boosting or boosting in decision trees refers to creation of an ensemble of decision trees for improving the accuracy of a single tree classifier or regressor. In tree-boosting, each of the trees in the collection is dependent on its prior trees. As the algorithm proceeds, it learns from the residual of the preceding trees.

**Mixed-effects Models:**

Mixed-effects models are statistical models which contain random effects (model parameters are random variables) and fixed effects (model parameters are fixed quantities). Read in detail about mixed-effects models here.

## Overview of GPBoost

Originally written in C++, the GPBoost library has a C-language API. Though it combines tree-boosting with GP and mixed-effects models, it also allows us to independently perform tree-boosting as well as using GP and mixed-effects models.

Tree-boosting and GP, two techniques are achieving state-of-the-art accuracy for making predictions have the following advantages, which can be combinedly leveraged using GPBoost.

**Pros of GP and mixed-effects models:**

- Enable making probabilistic predictions for quantifying uncertainties.
- Allows dependency modelling, i.e. finding a model which can describe dependencies among variables.

**Pros of tree-boosting:**

- Can handle missing values on its own while making predictions.
- Tree boosting provides scale-invariance (i.e. universality) for uniform transformations of the feature variables used for prediction.
- Can automatically model discontinuities, non-linearities and complex interactions.
- Robust to multicollinearity among variables used for prediction as well as outliers.

**GPBoost algorithm**

The label/response variable for GPBoost algorithm is assumed to be of the form:

`y = F(X) + Zb + xi `

…(i)

where,

X: covariates/ features/ predictor variables

F: non-linear mean function (predictor function)

Zb: random effects which can comprise Gaussian process, grouped random effects or a sum of both

xi: independent error term

Training the GPBoost algorithm refers to learning the hyperparameters (called covariance parameters) of the random effects and F(X) using an ensemble of decision trees. In simple terms, the GPBoost algorithm is a boosting algorithm that iteratively learns the hyperparameters using natural gradient descent or Nesterov accelerated gradient descent and adds a decision tree to the ensemble using Newton and/or gradient boosting. Trees then learn using the library LightGBM.

## Practical implementation

Here’s a demonstration of combining Tree-Boosting with GP models using GPBoost Python library. The code has been implemented using Google Colab with **Python 3.7.10, shap 0.39.0** and **gpboost 0.5.1** versions. Step-wise explanation of the code is as follows:

- Install GPBoost library

`!pip install gpboost`

- Install SHAP (
**SH**apely**A**dditive**P**lanations) library for explaining output of the GP model.

`!pip install shap`

- Import required libraries

import numpy as np import gpboost as gp import shap

- Define parameters for simulating a Gaussian process

sigma2_1 = 0.35 # marginal variance of GP """ range parameter which controls how fast the functions sampled from the Gaussian process oscillates """ rho = 0.1 sigma2 = 0.1 # error variance num_train = 200 # number of training samples # number of grid points on each axis for simulating the GP on a grid for visualization num_grid_pts = 50

- Define locations for training points (excluding upper-right rectangle).

#numpy.coulmn_stack() stacks 1D arrays as columns of a 2D array. coordinates = np.column_stack( (np.random.uniform(size=1)/2, np.random.uniform(size=1)/2)) “”” numpy.random.uniform() draws samples from a uniform distribution. size=1 means one sample will be drawn. “”” #While the number of coordinates is less than that of training samples while coordinates.shape[0] < num_train: #Draw 2 random samples from uniform distribution coordinate_i = np.random.uniform(size=2) #If atleast one of those 2 coordinates is less than 0.6 if not (coordinate_i[0] >= 0.6 and coordinate_i[1] >= 0.6): #stack the coordinates row-wise using numpy.vstack() coordinates = np.vstack((coordinates,coordinate_i))

- Define test points’ locations on a rectangular grid

“”” Initialize 2 arrays s1 and s2 (of number of grid points * number of grid points dimension) with ones “”” s1 = np.ones(num_grid_pts * num_grid_pts) s2 = np.ones(num_grid_pts * num_grid_pts) #Update the s1 and s2 arrays for i in range(num_grid_pts): for j in range(num_grid_pts): s1[j * num_grid_pts + i] = (i + 1) / num_grid_pts s2[i * num_grid_pts + j] = (i + 1) / num_grid_pts #Stack the arrays s1 and s2 as test coordinates coordinates_test = np.column_stack((s1, s2))

- Compute total number of data points as (number of grid points^2) + (number of training samples)

`num_total = num_grid_pts**2 + num_train`

Compute total number of grid coordinates

coordinates_total = np.vstack((coordinates_test,coordinates))

- Create distance matrix

#Initialize the matrix (of num_total * num_total dimension) with zeroes D = np.zeros((num_total, num_total)) #Update the distance matrix for i in range(0, num_total): for j in range(i + 1, num_total): D[i, j] = np.linalg.norm(coordinates_total[i, :] - coordinates_total[j, :]) D[j, i] = D[i, j]

- Compute noise standard deviation:

Sigma = sigma2_1 * np.exp(-D / rho) + np.diag(np.zeros(num_total) + 1e-10) C = np.linalg.cholesky(Sigma)

Compute random samples from a normal distribution (as many as total number of data

points) and perform its dot product with parameter C.

`b_total = C.dot(np.random.normal(size=num_total))`

Prepare training data GP

`b = b_total[(num_grid_pts*num_grid_pts):num_total] `

- Define mean function

Define features set X = np.random.rand(num_train, 2) Define non-linear mean function F(X) F_X = f1d(X[:, 0])

Compute independent error term

xi = np.sqrt(sigma2) * np.random.normal(size=num_train)

Compute the response variable (known as ‘label’) using the equation (i).

`y = F_X + b + xi`

- Prepare test data

“”” Select evenly spaced numbers (as many as square of number of grid points) in the range [0,1] “”” x = np.linspace(0,1,num_grid_pts**2) x[x==0.5] = 0.5 + 1e-10 #Test set features X_test = np.column_stack((x,np.zeros(num_grid_pts**2))) #Test set labels y_test = f1d(X_test[:, 0]) + b_total[0:(num_grid_pts**2)] + np.sqrt(sigma2) * np.random.normal(size=(num_grid_pts**2))

- Model training

# Create Gaussian process model gpmod = gb.GPModel(gp_coords=coordinates, cov_function="exponential") “”” cov_function denoted the covariance function for GP. ‘exponential’, ‘matern’, ‘gaussian’ and ‘powered_exponential’ are the possible values; ‘exponential’ being the default one “”” #Create dataset for GP using features set X and labels y train_data = gb.Dataset(X, y) #Define a dictionary for parameters of the GP parameters = { 'objective': 'regression_l2', 'learning_rate': 0.01, 'max_depth': 3, 'min_data_in_leaf': 10, 'num_leaves': 2**10, 'verbose': 0 } #Train the GP model with provided parameters model_train = gb.train(params=parameters, train_set=train_data, gp_model=gpmod, num_boost_round=247) #num_boost_round denotes the number of boosting iterations #Print the covariance parameters estimated by the GP model print("Estimated covariance parameters:") gpmod.summary()

**Output:**

Estimated covariance parameters: Covariance parameters ['Error_term', 'GP_var', 'GP_range'] [1.28340739e-268 2.90711171e-001 5.47936824e-002]

- Make predictions bypassing the GP features (i.e. prediction locations/coordinates here) and predictor variables for the tree ensemble to the predict() function. The function separately returns predictions for the tree ensemble and the GP. Add both of them to get a single point prediction.

prediction = model_train.predict(data=X_test, gp_coords_pred=coordinates_test, predict_var=True) “”” gp_coords_pred denotes the features for GP.predict_var=True means predictive variances will also be computed in addition to predictive mean “”” #Add the predictions for GP and tree ensemble y_pred = prediction['fixed_effect'] + prediction['random_effect_mean'] #Compute and print the mean-squared error print("Mean square error (MSE): " + str(np.mean((y_pred-y_test)**2)))

**Output:**

`Mean square error (MSE): 0.367071629709704`

- Interpret the trained model using SHAP library

shap_values = shap.TreeExplainer(model_train).shap_values(X) #Display the summary of gpmod model shap.summary_plot(shap_values, X)

**Output:**

#Display shap dependence plot shap.dependence_plot("Feature 0", shap_values, X)

**Output:**

- Trees used as base learners for boosting have several parameters. Here, we tune those parameters

# Create random effects model gp_mod = gb.GPModel(gp_coords=coordinates, cov_function="exponential") #Set parameters for estimating the GP covariance parameters gp_mod.set_optim_params(params={"optimizer_cov": "gradient_descent"}) #’optimizer_cov’ denotes optimizer to be used for estimation #Create dataset for GP train_set = gb.Dataset(X, y) # Define parameter grid parameter_grid = {'learning_rate': [0.1,0.05,0.01], 'min_data_in_leaf': [5,10,20,50], 'max_depth': [1,3,5,10,20]} # Other parameters to be tuned parameters = { 'objective': 'regression_l2', 'verbose': 0, 'num_leaves': 2**17 } “”” grid_search_tune_parameters() randomly chooses tuning parameters from the defined grid using cross-validation “”” opt_parameters = gb.grid_search_tune_parameters( param_grid=parameter_grid, params=parameters, num_try_random=20, """ grid search will be run 20 times, each time using a different combination of tuning parameters """ nfold=4, #value of ‘k’ for k-fold cross-validation gp_model=gp_mod, use_gp_model_for_validation=True, train_set=train_set, verbose_eval=1, num_boost_round=1000, #number of boosting iterations early_stopping_rounds=5, #number of early stopping iterations seed=1, metrics='l2') """ Best parameters’s combination will be returned in each of the 20 rounds. Display them. """ print("Best number of iterations: " + str(opt_parameters['best_iter'])) print("Best score: " + str(opt_parameters['best_score'])) print("Best parameters: " + str(opt_parameters['best_params']))

**Output:**

- Code source: GitHub
- Google colab notebook of the above implementation