In machine learning, we see that building an accurate model requires putting an optimal set of hyperparameters. Multiple techniques of hyperparameter tuning are used widely for finding an optimal set of hyperparameters. There are various techniques like grid search, random search, etc. that are used for this purpose. This process of hyperparameters tuning is usually performed very slowly. The Bayesian statistics can also be used for parameter tuning and also it can make the process faster especially in the case of neural networks. In this article, we are going to discuss the process of hyperparameter tuning in neural networks using Bayesian optimization. The major points that we will discuss here are listed below.
Table of Contents
- Tuning with Bayesian Statistics
- Mathematical Functions
- Bayesian Inference
- Gaussian Process
- Benefits of Tuning using Bayesian Statistics
- Implementation in Python
Tuning with Bayesian Statistics
In any modelling procedure we define functions especially in the neural networks we define layers and stack them on top of one another. In these functions or on the layers in the case of neural networks we use the parameters. By defining the values in the parameters we provide these functions to complete a task on the data or on the preceding or succeeding layer. Now assume the layer which we can define as an objective function and that it uses a set of parameters. This function can produce values such as loss, accuracy, MSE, etc to show off its performance level and we always want to maximize these values. We can say that this function is a black box because we don’t know the inner structure of the function and we just know about the parameters which the function will use.
Now we know that in hyperparameter tuning we find an optimal set of parameters we can maximize the performance metrics of the functions and for this purpose, we can use the bayesian statistics. Where Bayesian algorithms can help in finding the parameters to evaluate our objective function using the Gaussian process where the Gaussian model helps in understanding the structure of the objective function. The Bayesian algorithm optimizes the objective function whose structure is known from the Gaussian model by choosing the right set of parameters for the function from the parameters space. The process keeps searching the set of parameters until it finds the stopping condition for convergence.
Let’s start with the mathematics behind the Bayesian algorithms. In the next section, we will discuss the Bayesian inference and Gaussian process which will create a better understanding of the motive of the article.
Mathematical Functions
As we have discussed before, let’s say that we have a function with a set of parameters X. Our requirement is to find an optimal set of the parameters using Bayesian statistics.
As we know that f(.) is a black box function and it becomes expensive to evaluate. In such a case we need a process that can easily evaluate the function for which we can use the Gaussian process as a surrogate model for function.
Bayesian Inference
In Bayesian inference, the maximum likelihood estimation is a method of estimating the parameters of an assumed probability distribution. Let’s consider that we have a data sample X and we need to identify the distribution of the data sample. Using the MLE we can assume that the X is following a certain distribution and parameter or the distribution is θ like as follows:
X ∼ g(⋅∣θ)
Then the likelihood of the data sample will be,

And the estimation of the parameters that is maximum likelihood estimation will be,

Now we have already made the assumption regarding the distribution of the data and we want to make that distribution to describe the data properly. The Bayesian inference for probability distribution will be,

Here we can see that the q is prior and the p(θ∣X) is posterior. The reason behind this is we assume that θ ∼ q.
We have very little information about the parameters (θ). In Bayesian inference, we want to find the distribution of parameters based on the sample data which we have. So the output of the posterior will be the distribution of parameters when the data X is given. By maximizing the posterior distribution we can compute the variance. The information consisting of the posterior will be in a sufficient amount about the random variables.
Gaussian Process
We can consider the Gaussian process as a modelling function in the Bayesian statistics which can also be used as the surrogate model for describing the function which we are using for optimization or the function f(.).
The procedure of the Gaussian process in Bayesian inference starts with initializing some points. Let’s say the points are x1,x2,x3 ……xn. It then computes the function of these points and merges the point function in a vector.

Here the procedure of the vector is drawn randomly using the prior probability distribution. Now the next step of the process is to choose a prior distribution for the vector as multivariate normal. Where them multivariate normal consists of two parameters:
- Mean Vector
- Covariance Matrix.
Using the mean vector a mean function is calculated for each point and using the evaluation of the covariance function covariance matrix is constructed. This construction can be done by many methods and ways. So the prior built on the vector is

Here the K is a kernel or n x n matrix with Kij = k(xi, xj).
Here from the above, we can understand that using the gaussian process we can define the prior over function which can help in updating the posterior distribution according to the Bayesian inference.
Now an evaluation function can be used for selecting the sample with the help of an updated posterior. Normally the expected improvement function is used as the evaluation function.

In the process, we have seen that the function is following the Gaussian model and we can compute the expectation function as follows.

Where,

So by the above, we can say that if the calculation result of the EI(x) is higher which means the Gaussian process will have a higher value on average.
Benefits of Tuning Using Bayesian Statistics
As of now, we have seen the process and technique on which Bayesian statistics works. By seeing them we can make inferences that there are various benefits of Bayesian statistics. Some of them are listed below:
- The process uses randomized candidate points which makes sure that tuning will not take so much time.
- The Gaussian process under process makes it stronger in terms of performance.
- Optimizing the model using bayesian can create a balance between exploration and exploitation. Sampling the points makes an optimal value to be found after exploring the parameter space.
- It uses the input as the range of each parameter which is a better point of the process which helps in boosting the procedure.
Implementation in Python
In this part of the article, we are going to make a sequential neural network using the Keras and will perform the hyperparameter tuning using the bayesian statistic. For this purpose, we are using a package named BayesianOptimization which can be installed using the following code.
!pip install bayesian-optimization
After installing we are ready to use the package. To know more detailed structure about the package the reader can follow this link
In the process we will try to make a model with hyperparameter tuning using the Bayesian statistics on the MNIST data set and the model will perform the image classification using the data. Since we are needed to perform some of the preprocessing of the data we are not posting codes here. We can find whole codes for preprocessing here.
Let’s start with importing the libraries.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, BatchNormalization, MaxPooling2D, Flatten, Activation
from tensorflow.python.keras.optimizer_v2 import rmsprop
from bayes_opt import BayesianOptimization
Next, we can build a function that can provide a model with parameters.
NUM_CLASSES =10
def get_model(input_shape, dropout2_rate=0.5, dense_1_neurons=128):
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu'
input_shape=input_shape,
name="conv2d_1"))
model.add(Conv2D(64, (3, 3), activation='relu', name="conv2d_2"))
model.add(MaxPooling2D(pool_size=(2, 2), name="maxpool2d_1"))
model.add(Dropout(0.25, name="dropout_1"))
model.add(Flatten(name="flatten"))
model.add(Dense(dense_1_neurons, activation='relu', name="dense_1"))
model.add(Dropout(dropout2_rate, name="dropout_2"))
model.add(Dense(NUM_CLASSES, activation='softmax', name="dense_2"))
return model
Now we can make another function that will take the arguments:
- Verbose
- Input_shape
- Dropout rate
- Learning rate
Where we can fix the verbose and input shape and we need to make a model fine-tuned on the dropping rate and learning rate. Here in the model, we will define a function that will hyper-tune the dropout rate for the second dropout layer and learning rate.
def fit_model(input_shape, verbose, dropout2_rate, dense_1_neurons_x128, lr):
dense_1_neurons = max(int(dense_1_neurons_x128 * 128), 128)
model = get_model(input_shape, dropout2_rate, dense_1_neurons)
opt = tf.keras.optimizers.RMSprop(learning_rate=lr)
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=opt,
metrics=['accuracy'])
model.fit(x=train, epochs=1, steps_per_epoch=468,
batch_size=64, verbose=verbose)
score = model.evaluate(test, steps=10, verbose=0)
print('loss:', score[0])
print('accuracy:', score[1])
return score[1]
Now we can use the function for training and checking the accuracy without hyperparameter tuning.
verbose = 1
fit_partial = partial(fit_model, input_shape, verbose)
fit_model_partial(dropout2_rate=0.5, lr=0.001)
Output:

Now we can see that the functions are working properly. We can use the package BayesianOptimization for hyperparameter tuning and fitting the model on the tuned parameter.
from bayes_opt import BayesianOptimization
pbounds = {
'dropout2_rate': (0.1, 0.5),
'lr': (1e-4, 1e-2),
}
optimizer = BayesianOptimization(
f=fit_with_partial,
pbounds=pbounds,
verbose=2,
random_state=1,
)
Output:

Here in the above picture results of some iteration from the hyperparameter tuning. The highlighted portion of the results represents some of the best accuracies from the tuning procedure.
We can find out the maximized accuracy count with the tuned parameters for the model by using the maximize function of the optimizer module.
for x, res in enumerate(optimizer.res):
print("Iteration {}: \n\t{}".format(x, res))
print(optimizer.max)
Output:

Here we can see the best set of parameters using the Bayesian statistics for hyperparameter tuning of the neural networks.
As I have performed the above-given steps I found that the process was faster than the other hyperparameter tuning process like grid search and random search. We also see that we are not required to code so much using some lines of the code. We can implement hyperparameter tuning using the BayesianOptimization package.
Final words
Here in the article, we have seen that how Bayesian statistics helps in the process of hyperparameter tuning with an overview of Bayesian statistics and the Gaussian process. Along with this, we have seen how the whole process can be implemented in python using the Bayesian Optimization package. We have also seen the benefits of hyperparameter tuning using bayesian statistics.