Model optimization plays a vital role in improving the performance of a Deep Neural Network (DNN). Techniques such as Batch Normalization and Weight Standardization perform Z-score standardization on activations or weights of the network. This article describes a novel optimization method called ‘Gradient Centralization (GC)’ which works directly on gradients instead. It was introduced by Hongwei Yong, Jianqiang Huang, Xiansheng Hua and Lei Zhang – researchers at The Hong Kong Polytechnic University and the DAMO Academy in April 2020.
Overview of Gradient Centralization
GC is a technique of applying gradient descent with a controlled loss function. It imposes such constraints on the loss function by introducing a new constraint on the weight vector. It improves the generalization performance of DNNs by regulating the output feature as well as the weight space. Besides,it improves Lipshitzness of the gradient and the loss function thereby stabilizing the training process of the network and augmenting the efficiency of the process as well.
How Gradient Centralization differs from Batch Normalization and Weight Standardization?
Batch Normalization (BN) optimizes the DNN using first and second-order statistics to carry out Z-score standardization on activation functions. Weight Standardization (WS) technique also applies the same Z-score standardization but on weight vectors. Both BN and WS have the ability to improve Lipshitz property of the loss function. What if such normalization is carried out directly on the gradients instead of dealing with weights or activations?
If Z-score standardization is performed to normalize the gradients, as BN and WS do for normalizing activations and weights respectively, the stability of model training does not improve. So GC employs a different technique. Instead of dealing with mean, variance etc. as done by methods like BN and WS, it centralizes the gradients to have zero mean and hence the name Gradient “Centralization”.
How does Gradient Centralization work?
Let us have a look at the basic notations first.
Weight matrix for FC (Fully Connected) layers:
…(i)
Weight tensor for a convolutional layer is denoted as:
…(ii)
In above equations (i) and (ii),
Cin: number of input channels
Cout: number of output channels
k1and k2: size of the kernels in convolutional layers
In general, the weight matrix can be denoted as W.
wLand wiL represent the gradient of loss function L w.r.t. weight matrix W and weight vectorwi(i = 1,2,…,N) (i.e. ith column vector of the weight matrix) respectively.
The GC operator GCfor ith weight vector wiis given as:
GC(∇wiL) = wiL – wiL
Here, wiL = 1Mj=1M∇w(i,j)L
M is the number of neurons in the previous layer and N is the number of neurons in the next layer.
A simple single step taken by GC for optimization is to centralize the gradient vector to have zero mean. For doing so, it calculates the slice/column mean of each of the column tensors/vectors of the weight matrix and then removes the mean from every column vector. The process can be diagrammatically represented as follows:
Image source: GC research paper
Highlighting features of Gradient Centralization
- It can perform weight space regularization and output feature space regularization. This reduces overfitting of the model on training data and improves its generalization performance.
- It can smoothen the optimization landscape similar to Batch Normalization (BN) and Weight Standardization (WS). It thus makes the weights’ gradient more predictive and stable for rapid model training.
- It also avoids gradient explosion, thereby stabilizing the model training process.
- It can be easily embedded into current gradient-based DNN optimization algorithms such as Adam and SGDM.
Practical implementation
Here’s a demonstration of GC using gradient-centralization-tf, a Python package designed to implement GC with TensorFlow. We have used the Horses or Humans dataset having 500 rendered images of horses and 527 rendered images of humans in different poses and locations. Each image has 300*300 pixels dimensions and 24-bit color. We have implemented the code in Google colab using Python 3.7.10 version.
Step-wise explanation of the code is as follows:
- Install the gradient-centralization-tf package using pip command
!pip install gradient-centralization-tf
- Import required libraries
import tensorflow as tf from time import time #for execution time computation import os #for interacting with the Operating System import zipfile #for extracting dataset’s .zip files import gctf #for image augmentation from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.optimizers import RMSprop #to print results of comparison in tabular form from tabulate import tabulate
- Download the data from GCS (Google Cloud Storage)
#Get training data !wget --no-check-certificate \ human.zip \ -O /tmp/horse-or-human.zip #Get validation data !wget --no-check-certificate \ https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip \ -O /tmp/validation-horse-or-human.zip
- Read and extract the dataset’s downloaded .zip files
#Path of the training data file file = '/tmp/horse-or-human.zip' #Read the zip file reference = zipfile.ZipFile(file, 'r') #Extract the data reference.extractall('/tmp/horse-or-human') #Repeat the process for extracting validation data file = '/tmp/validation-horse-or-human.zip' reference = zipfile.ZipFile(file, 'r') reference.extractall('/tmp/validation-horse-or-human') #Close the archive file using ZipFile.close() reference.close()
- Create separate directories for horses and humans images to be used for training and validation
# Directory for training horse pictures horse_train = os.path.join('/tmp/horse-or-human/horses') # Directory for training human pictures human_train = os.path.join('/tmp/horse-or-human/humans') # Directory for training horse pictures horse_validation = os.path.join('/tmp/validation-horse-or-human/horses') # Directory for training human pictures human_validation = os.path.join('/tmp/validation-horse-or-human/humans')
- Increase amount training and validation data by image augmentation
#Modify training image trainDatagen = ImageDataGenerator( rescale=1./255, #rescaling factor rotation_range=40, #rotate image by 40 degrees width_shift_range=0.2, #fraction of total image width height_shift_range=0.2, #fraction of total image height shear_range=0.2, #shear intensity zoom_range=0.2, #zooming range will be [1-0.2,1+0.2] = [0.8,1.2] horizontal_flip=True, #flip the image horizontally fill_mode='nearest') #way to fill the points outside the input’s boundaries #Modify validation set images validDatagen = ImageDataGenerator(rescale=1/255) # Flow training images in batches of 128 using trainDatagen generator trainGen = trainDatagen.flow_from_directory( '/tmp/horse-or-human/', # source directory for training images target_size=(300, 300), batch_size=128, #binary labels required because we will use binary_crossentropy loss as this #is a binary classification task (classify images as horse or human) class_mode='binary') # Similarly, flow validation images in batches of 32 using validDatagen validationGen = validDatagen.flow_from_directory( '/tmp/validation-horse-or-human/', target_size=(300, 300), batch_size=32, class_mode='binary')
Output:
Found 1027 images belonging to 2 classes. Found 256 images belonging to 2 classes.
The above output shows that our data has a total 1027 training images and 256 images for validation. Each of the images belongs either of the two classes – horse or human.
- Build the DNN model
myModel = tf.keras.models.Sequential([ # 1st convolution #convolutional layer - 16 filters used and kernel size is 3*3 tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)), tf.keras.layers.MaxPooling2D(2, 2), #pooling layer # 2nd convolution #convolutional layer - 16 filters used and kernel size is 3*3 tf.keras.layers.Conv2D(32, (3,3), activation='relu'), tf.keras.layers.Dropout(0.5), #dropout regularization tf.keras.layers.MaxPooling2D(2,2), #pooling layer # 3rd convolution #convolutional layer - 16 filters used and kernel size is 3*3 tf.keras.layers.Conv2D(64, (3,3), activation='relu'), tf.keras.layers.Dropout(0.5), #dropout regularization tf.keras.layers.MaxPooling2D(2,2), #pooling layer # Flatten the results to feed into a DNN tf.keras.layers.Flatten(), tf.keras.layers.Dropout(0.5), #dropout regularization # Hidden layer with 512 neurons tf.keras.layers.Dense(512, activation='relu'), #Output layer with a single neuron. It will give output 0 (for horse) or 1 (for human) tf.keras.layers.Dense(1, activation='sigmoid') ])
- Create a class for computing training time so that we can compare it for model using GC and that without GC used for optimization
class TimeTaken(tf.keras.callbacks.Callback): def on_train_begin(self, logs={}): #records time when training begins self.times = [] def on_epoch_begin(self, batch, logs={}): #records time when an epoch begins self.epoch_time_start = time() def on_epoch_end(self, batch, logs={}): #records time when an epoch ends #On subtracting epoch’s starting time from the current time, we get time #taken to end the epoch self.times.append(time() - self.epoch_time_start)
- Train the model without using GC
time1 = TimeTaken() #Compile the model myModel.compile(loss='binary_crossentropy', #loss function optimizer=RMSprop(lr=1e-4), #’lr’ is the learning rate metrics=['accuracy']) #Fit the model on the training data hist1 = myModel.fit( trainGen, steps_per_epoch=8, #number of steps for each epoch epochs=10, #number of epochs verbose=1, validation_data = validationGen, validation_steps=8, #number of validation steps callbacks = [time1])
Sample output:
- Train the model with GC used for optimization
time2 = TimeTaken() #Compile the model myModel.compile(loss='binary_crossentropy', optimizer=gctf.optimizers.rmsprop(learning_rate = 1e-4), metrics=['accuracy']) #Fit the model on training data hist2 = myModel.fit( trainGen, steps_per_epoch=8, epochs=10, verbose=1, validation_data = validationGen, validation_steps=8, callbacks = [time2])
Sample output:
- Compare the results of execution with and without GC
comparisonData = [["Model w/o gctf:",sum(time1.times),hist1.history['accuracy'][-1],hist1.history['loss'][-1]], ["Model with gctf",sum(time2.times),hist2.history['accuracy'][-1], hist2.history['loss'][-1]]] #Tabulate the comparisonData’s information using tabulate() method print(tabulate(comparisonData, headers=["Type","Execution time", "Accuracy", "Loss"]))
Sample output:
Note: The outputs may vary for each iteration and depending upon the execution environment you use.
- Conclusion: The tabular output shows that using GC for model optimization reduces the model training time as well as loss besides improving the model’s accuracy.
- Google colab notebook of the above implementation can be found here.
References
Refer to the following sources to have an in-depth understanding of the Gradient Centralization technique and the related package for its Python implementation: