Guide To GPBoost: A Library To Combine Tree-Boosting With Gaussian Process And Mixed-Effects Models

GPBoost is an approach and a software library aimed at combining tree-boosting with mixed-effects models and Gaussian Process (GP); hence the name ‘GP + Tree-Boosting’.

Share

Published on April 11, 2021

by Nikita Shiledarbaxi

GPBoost is an approach and a software library aimed at combining tree-boosting with mixed-effects models and Gaussian Process (GP); hence the name ‘GP + Tree-Boosting’. It was introduced by Fabio Sigrist, a professor from Lucerne University of Applied Sciences and Arts in December 2020 (research paper).

Before going into the details of GPBoost, let us have an overview of the terms ‘Gaussian Process’, ‘Tree-Boosting’ and ‘Mixed-effects Models’ mean.

Gaussian Process:

Gaussian process (GP) is a collection of some random variables such that each finite linear combination of those variables has a normal distribution. It is a probabilistic distribution over functions possible for resolving uncertainty in Machine Learning tasks such as regression and classification. Visit this page for a detailed description of GP.

Tree-Boosting:

Tree-boosting or boosting in decision trees refers to creation of an ensemble of decision trees for improving the accuracy of a single tree classifier or regressor. In tree-boosting, each of the trees in the collection is dependent on its prior trees. As the algorithm proceeds, it learns from the residual of the preceding trees.

Mixed-effects Models:

Mixed-effects models are statistical models which contain random effects (model parameters are random variables) and fixed effects (model parameters are fixed quantities). Read in detail about mixed-effects models here.

Overview of GPBoost

Originally written in C++, the GPBoost library has a C-language API. Though it combines tree-boosting with GP and mixed-effects models, it also allows us to independently perform tree-boosting as well as using GP and mixed-effects models.

Tree-boosting and GP, two techniques are achieving state-of-the-art accuracy for making predictions have the following advantages, which can be combinedly leveraged using GPBoost.

Pros of GP and mixed-effects models:

Enable making probabilistic predictions for quantifying uncertainties.
Allows dependency modelling, i.e. finding a model which can describe dependencies among variables.

Pros of tree-boosting:

Can handle missing values on its own while making predictions.
Tree boosting provides scale-invariance (i.e. universality) for uniform transformations of the feature variables used for prediction.
Can automatically model discontinuities, non-linearities and complex interactions.
Robust to multicollinearity among variables used for prediction as well as outliers.

GPBoost algorithm

The label/response variable for GPBoost algorithm is assumed to be of the form:

y = F(X) + Zb + xi …(i)

where,

X: covariates/ features/ predictor variables

F: non-linear mean function (predictor function)

Zb: random effects which can comprise Gaussian process, grouped random effects or a sum of both

xi: independent error term

Training the GPBoost algorithm refers to learning the hyperparameters (called covariance parameters) of the random effects and F(X) using an ensemble of decision trees. In simple terms, the GPBoost algorithm is a boosting algorithm that iteratively learns the hyperparameters using natural gradient descent or Nesterov accelerated gradient descent and adds a decision tree to the ensemble using Newton and/or gradient boosting. Trees then learn using the library LightGBM.

Practical implementation

Here’s a demonstration of combining Tree-Boosting with GP models using GPBoost Python library. The code has been implemented using Google Colab with Python 3.7.10, shap 0.39.0 and gpboost 0.5.1 versions. Step-wise explanation of the code is as follows:

Install GPBoost library

!pip install gpboost

Install SHAP (SHapely Additive exPlanations) library for explaining output of the GP model.

!pip install shap

Import required libraries

 import numpy as np
 import gpboost as gp
 import shap

Define parameters for simulating a Gaussian process

 sigma2_1 = 0.35  # marginal variance of GP
 """
 range parameter which controls how fast the functions sampled from the 
 Gaussian process oscillates
 """
 rho = 0.1      
 sigma2 = 0.1  # error variance
 num_train = 200   # number of training samples
 # number of grid points on each axis for simulating the GP on a grid for visualization
 num_grid_pts = 50

Define locations for training points (excluding upper-right rectangle).

 #numpy.coulmn_stack() stacks 1D arrays as columns of a 2D array.
 coordinates = np.column_stack( (np.random.uniform(size=1)/2, 
 np.random.uniform(size=1)/2))
 “””
 numpy.random.uniform() draws samples from a uniform distribution. size=1 means one sample will be drawn.
 “””
 #While the number of coordinates is less than that of training samples
 while coordinates.shape[0] < num_train:
    #Draw 2 random samples from uniform distribution
     coordinate_i = np.random.uniform(size=2)
    #If atleast one of those 2 coordinates is less than 0.6
     if not (coordinate_i[0] >= 0.6 and coordinate_i[1] >= 0.6):
      #stack the coordinates row-wise using numpy.vstack()
      coordinates = np.vstack((coordinates,coordinate_i))

Define test points’ locations on a rectangular grid

 “””
 Initialize 2 arrays s1 and s2 (of number of grid points * number of grid points dimension) with ones
 “””
 s1 = np.ones(num_grid_pts * num_grid_pts)
 s2 = np.ones(num_grid_pts * num_grid_pts)
 #Update the s1 and s2 arrays
 for i in range(num_grid_pts):
   for j in range(num_grid_pts):
     s1[j * num_grid_pts + i] = (i + 1) / num_grid_pts
     s2[i * num_grid_pts + j] = (i + 1) / num_grid_pts
 #Stack the arrays s1 and s2 as test coordinates
 coordinates_test = np.column_stack((s1, s2))

Compute total number of data points as (number of grid points^2) + (number of training samples)

num_total = num_grid_pts**2 + num_train

Compute total number of grid coordinates

coordinates_total = np.vstack((coordinates_test,coordinates))

Create distance matrix

 #Initialize the matrix (of num_total * num_total dimension) with zeroes
 D = np.zeros((num_total, num_total))  
 #Update the distance matrix
 for i in range(0, num_total):
   for j in range(i + 1, num_total):
     D[i, j] = np.linalg.norm(coordinates_total[i, :] - coordinates_total[j, :])
     D[j, i] = D[i, j]

Compute noise standard deviation:

 Sigma = sigma2_1 * np.exp(-D / rho) + np.diag(np.zeros(num_total) + 1e-10)
 C = np.linalg.cholesky(Sigma)

Compute random samples from a normal distribution (as many as total number of data

points) and perform its dot product with parameter C.

b_total = C.dot(np.random.normal(size=num_total))

Prepare training data GP

b = b_total[(num_grid_pts*num_grid_pts):num_total]

Define mean function

 Define features set
 X = np.random.rand(num_train, 2)
 Define non-linear mean function F(X) 
 F_X = f1d(X[:, 0])

Compute independent error term

xi = np.sqrt(sigma2) * np.random.normal(size=num_train)

Compute the response variable (known as ‘label’) using the equation (i).

y = F_X + b + xi

Prepare test data

 “””
 Select evenly spaced numbers (as many as square of number of grid points) in the range [0,1]
 “””
 x = np.linspace(0,1,num_grid_pts**2)
 x[x==0.5] = 0.5 + 1e-10
 #Test set features
 X_test = np.column_stack((x,np.zeros(num_grid_pts**2)))
 #Test set labels
 y_test = f1d(X_test[:, 0]) + b_total[0:(num_grid_pts**2)] + np.sqrt(sigma2) * np.random.normal(size=(num_grid_pts**2))

Model training

 # Create Gaussian process model 
 gpmod = gb.GPModel(gp_coords=coordinates, cov_function="exponential")
 “””
 cov_function denoted the covariance function for GP. ‘exponential’, ‘matern’, ‘gaussian’ and ‘powered_exponential’ are the possible values; ‘exponential’ being the default one
 “””
 #Create dataset for GP using features set X and labels y
 train_data = gb.Dataset(X, y)
 #Define a dictionary for parameters of the GP
 parameters = { 'objective': 'regression_l2', 'learning_rate': 0.01,
 'max_depth': 3, 'min_data_in_leaf': 10, 'num_leaves': 2**10, 'verbose': 0 }
 #Train the GP model with provided parameters
 model_train = gb.train(params=parameters, train_set=train_data,
 gp_model=gpmod, num_boost_round=247)
 #num_boost_round denotes the number of boosting iterations
 #Print the covariance parameters estimated by the GP model
 print("Estimated covariance parameters:")
 gpmod.summary()

Output:

 Estimated covariance parameters:
 Covariance parameters 
 ['Error_term', 'GP_var', 'GP_range']
 [1.28340739e-268 2.90711171e-001 5.47936824e-002]

Make predictions bypassing the GP features (i.e. prediction locations/coordinates here) and predictor variables for the tree ensemble to the predict() function. The function separately returns predictions for the tree ensemble and the GP. Add both of them to get a single point prediction.

 prediction = model_train.predict(data=X_test, gp_coords_pred=coordinates_test, predict_var=True)
 “””
 gp_coords_pred denotes the features for GP.predict_var=True means predictive variances will also be computed in addition to predictive mean
 “””
 #Add the predictions for GP and tree ensemble
 y_pred = prediction['fixed_effect'] + prediction['random_effect_mean']
 #Compute and print the mean-squared error
 print("Mean square error (MSE): " + str(np.mean((y_pred-y_test)**2)))

Output:

Mean square error (MSE): 0.367071629709704

Interpret the trained model using SHAP library

 shap_values = shap.TreeExplainer(model_train).shap_values(X)
 #Display the summary of gpmod model
 shap.summary_plot(shap_values, X)

Output:

 #Display shap dependence plot
 shap.dependence_plot("Feature 0", shap_values, X)

Output:

Trees used as base learners for boosting have several parameters. Here, we tune those parameters

 # Create random effects model 
 gp_mod = gb.GPModel(gp_coords=coordinates, cov_function="exponential")
 #Set parameters for estimating the GP covariance parameters
 gp_mod.set_optim_params(params={"optimizer_cov": "gradient_descent"})
 #’optimizer_cov’ denotes optimizer to be used for estimation
 #Create dataset for GP
 train_set = gb.Dataset(X, y)
 # Define parameter grid
 parameter_grid = {'learning_rate': [0.1,0.05,0.01], 'min_data_in_leaf': [5,10,20,50],
 'max_depth': [1,3,5,10,20]}
 # Other parameters to be tuned
parameters = { 'objective': 'regression_l2', 'verbose': 0, 'num_leaves': 2**17 }
 “””
 grid_search_tune_parameters() randomly chooses tuning parameters from the defined grid using cross-validation
 “””
 opt_parameters = gb.grid_search_tune_parameters(
   param_grid=parameter_grid,
   params=parameters,
   num_try_random=20,
"""
grid search will be run 20 times, each time using a different combination of tuning parameters
"""
   nfold=4,  #value of ‘k’ for k-fold cross-validation
   gp_model=gp_mod,
   use_gp_model_for_validation=True,
   train_set=train_set,
   verbose_eval=1,
   num_boost_round=1000,  #number of boosting iterations
   early_stopping_rounds=5,  #number of early stopping iterations
   seed=1,
   metrics='l2')
"""
Best parameters’s combination will be returned in each of the 20 rounds. Display them.
"""
 print("Best number of iterations: " + str(opt_parameters['best_iter']))
 print("Best score: " + str(opt_parameters['best_score']))
 print("Best parameters: " + str(opt_parameters['best_params']))

Output: