Now Reading
Hands-on Guide to Gradient Centralization


Hands-on Guide to Gradient Centralization

Nikita Shiledarbaxi
gradient centralization

Model optimization plays a vital role in improving the performance of a Deep Neural Network (DNN). Techniques such as Batch Normalization and Weight Standardization perform Z-score standardization on activations or weights of the network. This article describes a novel optimization method called ‘Gradient Centralization (GC)’ which works directly on gradients instead. It was introduced by Hongwei Yong, Jianqiang Huang, Xiansheng Hua and Lei Zhang – researchers at The Hong Kong Polytechnic University and the DAMO Academy in April 2020.

Overview of Gradient Centralization 

GC is a technique of applying gradient descent with a controlled loss function. It imposes such constraints on the loss function by introducing a new constraint on the weight vector. It improves the generalization performance of DNNs by regulating the output feature as well as the weight space. Besides,it improves Lipshitzness of the gradient and the loss function thereby stabilizing the training process of the network and augmenting the efficiency of the process as well.

How Gradient Centralization differs from Batch Normalization and Weight Standardization?

Batch Normalization (BN) optimizes the DNN using first and second-order statistics to carry out Z-score standardization on activation functions. Weight Standardization (WS) technique also applies the same Z-score standardization but on weight vectors. Both BN and WS have the ability to improve Lipshitz property of the loss function. What if such normalization is carried out directly on the gradients instead of dealing with weights or activations? 

If Z-score standardization is performed to normalize the gradients, as BN and WS do for normalizing activations and weights respectively, the stability of model training does not improve. So GC employs a different technique. Instead of dealing with mean, variance etc. as done by methods like BN and WS, it centralizes the gradients to have zero mean and hence the name Gradient “Centralization”.

How does Gradient Centralization work?

Let us have a look at the basic notations first.

Weight matrix for FC (Fully Connected) layers:

              …(i)

Weight tensor for a convolutional layer is denoted as:

  …(ii)

In above equations (i) and (ii),

 Cin: number of input channels

 Cout: number of output channels

 k1and k2: size of the kernels in convolutional layers

In general, the weight matrix can be denoted as W.

wLand wiL represent the gradient of loss function L w.r.t. weight matrix W and weight vectorwi(i = 1,2,…,N)  (i.e. ith column vector of the weight matrix) respectively.

The GC operator GCfor ith weight vector wiis given as:

GC(∇wiL) = wiL – wiL

Here, wiL = 1Mj=1M∇w(i,j)L

M is the number of neurons in the previous layer and N is the number of neurons in the next layer.

A simple single step taken by GC for optimization is to centralize the gradient vector to have zero mean. For doing so, it calculates the slice/column mean of each of the column tensors/vectors of the weight matrix and then removes the mean from every column vector. The process can be diagrammatically represented as follows:

Image source: GC research paper

See Also
Pyflux Guide to Time Series Forecasting

Highlighting features of Gradient Centralization

  •  It can perform weight space regularization and output feature space regularization. This reduces overfitting of the model on training data and improves its generalization performance.
  • It can smoothen the optimization landscape similar to Batch Normalization (BN) and Weight Standardization (WS). It thus makes the weights’ gradient more predictive and stable for rapid model training.
  • It also avoids gradient explosion, thereby stabilizing the model training process.
  • It can be easily embedded into current gradient-based DNN optimization algorithms such as Adam and SGDM.

Practical implementation 

Here’s a demonstration of GC using gradient-centralization-tf, a Python package designed to implement GC with TensorFlow. We have used the Horses or Humans dataset having 500 rendered images of horses and 527 rendered images of humans in different poses and locations. Each image has 300*300 pixels dimensions and 24-bit color. We have implemented the code in Google colab using Python 3.7.10 version. 

Step-wise explanation of the code is as follows:

  1. Install the gradient-centralization-tf package using pip command

!pip install gradient-centralization-tf

  1. Import required libraries 
 import tensorflow as tf
 from time import time #for execution time computation
 import os  #for interacting with the Operating System
 import zipfile  #for extracting dataset’s .zip files
 import gctf
 #for image augmentation
 from tensorflow.keras.preprocessing.image import ImageDataGenerator
 from tensorflow.keras.optimizers import RMSprop
#to print results of comparison in tabular form 
from tabulate import tabulate 
  1. Download the data from GCS (Google Cloud Storage)
 #Get training data
 !wget --no-check-certificate \
  human.zip \
     -O /tmp/horse-or-human.zip
 #Get validation data
 !wget --no-check-certificate \    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip \
             -O /tmp/validation-horse-or-human.zip 
  1. Read and extract the dataset’s downloaded .zip files
 #Path of the training data file
 file = '/tmp/horse-or-human.zip'
 #Read the zip file
 reference = zipfile.ZipFile(file, 'r')
 #Extract the data
 reference.extractall('/tmp/horse-or-human')
 #Repeat the process for extracting validation data
 file = '/tmp/validation-horse-or-human.zip'
 reference = zipfile.ZipFile(file, 'r')
 reference.extractall('/tmp/validation-horse-or-human')
 #Close the archive file using ZipFile.close()
 reference.close() 
  1. Create separate directories for horses and humans images to be used for training and validation
 # Directory for training horse pictures
 horse_train = os.path.join('/tmp/horse-or-human/horses')
 # Directory for training human pictures
 human_train = os.path.join('/tmp/horse-or-human/humans')
 # Directory for training horse pictures
 horse_validation = os.path.join('/tmp/validation-horse-or-human/horses')
 # Directory for training human pictures
 human_validation = os.path.join('/tmp/validation-horse-or-human/humans') 
  1. Increase amount  training and validation data by image augmentation
 #Modify training image
 trainDatagen = ImageDataGenerator(
       rescale=1./255,  #rescaling factor
       rotation_range=40,  #rotate image by 40 degrees
       width_shift_range=0.2,  #fraction of total image width
       height_shift_range=0.2,  #fraction of total image height
       shear_range=0.2,  #shear intensity
       zoom_range=0.2,  #zooming range will be [1-0.2,1+0.2] = [0.8,1.2]
       horizontal_flip=True,  #flip the image horizontally
       fill_mode='nearest')  
#way to fill the points outside the input’s boundaries
 
#Modify validation set images
 validDatagen = ImageDataGenerator(rescale=1/255) 
 # Flow training images in batches of 128 using trainDatagen generator
 trainGen = trainDatagen.flow_from_directory(
         '/tmp/horse-or-human/',  # source directory for training images
         target_size=(300, 300), 
         batch_size=128,
#binary labels required because we will use binary_crossentropy loss as this #is a binary classification task (classify images as horse or human)
         class_mode='binary')
 # Similarly, flow validation images in batches of 32 using validDatagen
 validationGen = validDatagen.flow_from_directory(
         '/tmp/validation-horse-or-human/', 
         target_size=(300, 300),  
         batch_size=32,
         class_mode='binary') 

Output:

 Found 1027 images belonging to 2 classes.
 Found 256 images belonging to 2 classes. 

The above output shows that our data has a total 1027 training images and 256 images for validation. Each of the images belongs either of the two classes – horse or human.

  1. Build the DNN model
 myModel = tf.keras.models.Sequential([
    # 1st convolution
                #convolutional layer - 16 filters used and kernel size is 3*3
     tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),
     tf.keras.layers.MaxPooling2D(2, 2),  #pooling layer
     # 2nd convolution
     #convolutional layer - 16 filters used and kernel size is 3*3
     tf.keras.layers.Conv2D(32, (3,3), activation='relu'),  
     tf.keras.layers.Dropout(0.5),  #dropout regularization
     tf.keras.layers.MaxPooling2D(2,2), #pooling layer
     # 3rd convolution
     #convolutional layer - 16 filters used and kernel size is 3*3
     tf.keras.layers.Conv2D(64, (3,3), activation='relu'), 
     tf.keras.layers.Dropout(0.5),  #dropout regularization   
     tf.keras.layers.MaxPooling2D(2,2),  #pooling layer
  # Flatten the results to feed into a DNN
     tf.keras.layers.Flatten(),
     tf.keras.layers.Dropout(0.5), #dropout regularization
     # Hidden layer with 512 neurons
     tf.keras.layers.Dense(512, activation='relu'),
 #Output layer with a single neuron. It will give output 0 (for horse) or 1 (for human)
      tf.keras.layers.Dense(1, activation='sigmoid')
 ]) 
  1. Create a class for computing training time so that we can compare it for model using GC and that without GC used for optimization
 class TimeTaken(tf.keras.callbacks.Callback):
     def on_train_begin(self, logs={}):  #records time when training begins
         self.times = []
     def on_epoch_begin(self, batch, logs={}): 
#records time when an epoch begins
         self.epoch_time_start = time()
     def on_epoch_end(self, batch, logs={}):  
#records time when an epoch ends
#On subtracting epoch’s starting time from the current time, we get time  #taken to end the epoch
         self.times.append(time() - self.epoch_time_start) 
  1. Train the model without using GC
 time1 = TimeTaken()
 #Compile the model
 myModel.compile(loss='binary_crossentropy', #loss function
               optimizer=RMSprop(lr=1e-4), #’lr’ is the learning rate
               metrics=['accuracy'])
 #Fit the model on the training data
 hist1 = myModel.fit(
       trainGen,
       steps_per_epoch=8,  #number of steps for each epoch
       epochs=10, #number of epochs
       verbose=1,
       validation_data = validationGen, 
       validation_steps=8, #number of validation steps
       callbacks = [time1]) 

Sample output:

  1. Train the model with GC used for optimization
 time2 = TimeTaken()
 #Compile the model
 myModel.compile(loss='binary_crossentropy',
               optimizer=gctf.optimizers.rmsprop(learning_rate = 1e-4),
               metrics=['accuracy'])
 #Fit the model on training data
 hist2 = myModel.fit(
       trainGen,
       steps_per_epoch=8,  
       epochs=10,
       verbose=1,
       validation_data = validationGen,
       validation_steps=8,
       callbacks = [time2]) 

Sample output:

  1. Compare the results of execution with and without GC
 comparisonData = [["Model w/o gctf:",sum(time1.times),hist1.history['accuracy'][-1],hist1.history['loss'][-1]],
  ["Model with gctf",sum(time2.times),hist2.history['accuracy'][-1],
 hist2.history['loss'][-1]]] 
 #Tabulate the comparisonData’s information using tabulate() method
 print(tabulate(comparisonData, headers=["Type","Execution time", "Accuracy", "Loss"])) 

Sample output:

Note: The outputs may vary for each iteration and depending upon the execution environment you use.

  • Conclusion: The tabular output shows that using GC for model optimization reduces the model training time as well as loss besides improving the model’s accuracy.
  • Google colab notebook of the above implementation can be found here.

References

Refer to the following sources to have an in-depth understanding of the Gradient Centralization technique and the related package for its Python implementation:

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top