# Hands-on Guide to Gradient Centralization

Model optimization plays a vital role in improving the performance of a Deep Neural Network (DNN). Techniques such as Batch Normalization and Weight Standardization perform Z-score standardization on activations or weights of the network. This article describes a novel optimization method called ‘Gradient Centralization (GC)’ which works directly on gradients instead. It was introduced by Hongwei Yong, Jianqiang Huang, Xiansheng Hua and Lei Zhang – researchers at The Hong Kong Polytechnic University and the DAMO Academy in April 2020.

## Overview of Gradient Centralization

GC is a technique of applying gradient descent with a controlled loss function. It imposes such constraints on the loss function by introducing a new constraint on the weight vector. It improves the generalization performance of DNNs by regulating the output feature as well as the weight space. Besides,it improves Lipshitzness of the gradient and the loss function thereby stabilizing the training process of the network and augmenting the efficiency of the process as well.

## How Gradient Centralization differs from Batch Normalization and Weight Standardization?

Batch Normalization (BN) optimizes the DNN using first and second-order statistics to carry out Z-score standardization on activation functions. Weight Standardization (WS) technique also applies the same Z-score standardization but on weight vectors. Both BN and WS have the ability to improve Lipshitz property of the loss function. What if such normalization is carried out directly on the gradients instead of dealing with weights or activations?

If Z-score standardization is performed to normalize the gradients, as BN and WS do for normalizing activations and weights respectively, the stability of model training does not improve. So GC employs a different technique. Instead of dealing with mean, variance etc. as done by methods like BN and WS, it centralizes the gradients to have zero mean and hence the name Gradient “Centralization”.

## How does Gradient Centralization work?

Let us have a look at the basic notations first.

Weight matrix for FC (Fully Connected) layers:

…(i)

Weight tensor for a convolutional layer is denoted as:

…(ii)

In above equations (i) and (ii),

Cin: number of input channels

Cout: number of output channels

k1and k2: size of the kernels in convolutional layers

In general, the weight matrix can be denoted as W.

wLand wiL represent the gradient of loss function L w.r.t. weight matrix W and weight vectorwi(i = 1,2,…,N)  (i.e. ith column vector of the weight matrix) respectively.

The GC operator GCfor ith weight vector wiis given as:

GC(∇wiL) = wiL – wiL

Here, wiL = 1Mj=1M∇w(i,j)L

M is the number of neurons in the previous layer and N is the number of neurons in the next layer.

A simple single step taken by GC for optimization is to centralize the gradient vector to have zero mean. For doing so, it calculates the slice/column mean of each of the column tensors/vectors of the weight matrix and then removes the mean from every column vector. The process can be diagrammatically represented as follows:

Image source: GC research paper

## Highlighting features of Gradient Centralization

•  It can perform weight space regularization and output feature space regularization. This reduces overfitting of the model on training data and improves its generalization performance.
• It can smoothen the optimization landscape similar to Batch Normalization (BN) and Weight Standardization (WS). It thus makes the weights’ gradient more predictive and stable for rapid model training.
• It also avoids gradient explosion, thereby stabilizing the model training process.
• It can be easily embedded into current gradient-based DNN optimization algorithms such as Adam and SGDM.

## Practical implementation

Here’s a demonstration of GC using gradient-centralization-tf, a Python package designed to implement GC with TensorFlow. We have used the Horses or Humans dataset having 500 rendered images of horses and 527 rendered images of humans in different poses and locations. Each image has 300*300 pixels dimensions and 24-bit color. We have implemented the code in Google colab using Python 3.7.10 version.

Step-wise explanation of the code is as follows:

1. Install the gradient-centralization-tf package using pip command

`!pip install gradient-centralization-tf`

1. Import required libraries
``` import tensorflow as tf
from time import time #for execution time computation
import os  #for interacting with the Operating System
import zipfile  #for extracting dataset’s .zip files
import gctf
#for image augmentation
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import RMSprop
#to print results of comparison in tabular form
from tabulate import tabulate ```
``` #Get training data
!wget --no-check-certificate \
human.zip \
-O /tmp/horse-or-human.zip
#Get validation data
!wget --no-check-certificate \    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip \
-O /tmp/validation-horse-or-human.zip ```
``` #Path of the training data file
file = '/tmp/horse-or-human.zip'
#Read the zip file
reference = zipfile.ZipFile(file, 'r')
#Extract the data
reference.extractall('/tmp/horse-or-human')
#Repeat the process for extracting validation data
file = '/tmp/validation-horse-or-human.zip'
reference = zipfile.ZipFile(file, 'r')
reference.extractall('/tmp/validation-horse-or-human')
#Close the archive file using ZipFile.close()
reference.close() ```
1. Create separate directories for horses and humans images to be used for training and validation
``` # Directory for training horse pictures
horse_train = os.path.join('/tmp/horse-or-human/horses')
# Directory for training human pictures
human_train = os.path.join('/tmp/horse-or-human/humans')
# Directory for training horse pictures
horse_validation = os.path.join('/tmp/validation-horse-or-human/horses')
# Directory for training human pictures
human_validation = os.path.join('/tmp/validation-horse-or-human/humans') ```
1. Increase amount  training and validation data by image augmentation
``` #Modify training image
trainDatagen = ImageDataGenerator(
rescale=1./255,  #rescaling factor
rotation_range=40,  #rotate image by 40 degrees
width_shift_range=0.2,  #fraction of total image width
height_shift_range=0.2,  #fraction of total image height
shear_range=0.2,  #shear intensity
zoom_range=0.2,  #zooming range will be [1-0.2,1+0.2] = [0.8,1.2]
horizontal_flip=True,  #flip the image horizontally
fill_mode='nearest')
#way to fill the points outside the input’s boundaries

#Modify validation set images
validDatagen = ImageDataGenerator(rescale=1/255)
# Flow training images in batches of 128 using trainDatagen generator
trainGen = trainDatagen.flow_from_directory(
'/tmp/horse-or-human/',  # source directory for training images
target_size=(300, 300),
batch_size=128,
#binary labels required because we will use binary_crossentropy loss as this #is a binary classification task (classify images as horse or human)
class_mode='binary')
# Similarly, flow validation images in batches of 32 using validDatagen
validationGen = validDatagen.flow_from_directory(
'/tmp/validation-horse-or-human/',
target_size=(300, 300),
batch_size=32,
class_mode='binary') ```

Output:

``` Found 1027 images belonging to 2 classes.
Found 256 images belonging to 2 classes. ```

The above output shows that our data has a total 1027 training images and 256 images for validation. Each of the images belongs either of the two classes – horse or human.

1. Build the DNN model
``` myModel = tf.keras.models.Sequential([
# 1st convolution
#convolutional layer - 16 filters used and kernel size is 3*3
tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),
tf.keras.layers.MaxPooling2D(2, 2),  #pooling layer
# 2nd convolution
#convolutional layer - 16 filters used and kernel size is 3*3
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.Dropout(0.5),  #dropout regularization
tf.keras.layers.MaxPooling2D(2,2), #pooling layer
# 3rd convolution
#convolutional layer - 16 filters used and kernel size is 3*3
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.Dropout(0.5),  #dropout regularization
tf.keras.layers.MaxPooling2D(2,2),  #pooling layer
# Flatten the results to feed into a DNN
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.5), #dropout regularization
# Hidden layer with 512 neurons
tf.keras.layers.Dense(512, activation='relu'),
#Output layer with a single neuron. It will give output 0 (for horse) or 1 (for human)
tf.keras.layers.Dense(1, activation='sigmoid')
]) ```
1. Create a class for computing training time so that we can compare it for model using GC and that without GC used for optimization
``` class TimeTaken(tf.keras.callbacks.Callback):
def on_train_begin(self, logs={}):  #records time when training begins
self.times = []
def on_epoch_begin(self, batch, logs={}):
#records time when an epoch begins
self.epoch_time_start = time()
def on_epoch_end(self, batch, logs={}):
#records time when an epoch ends
#On subtracting epoch’s starting time from the current time, we get time  #taken to end the epoch
self.times.append(time() - self.epoch_time_start) ```
1. Train the model without using GC
``` time1 = TimeTaken()
#Compile the model
myModel.compile(loss='binary_crossentropy', #loss function
optimizer=RMSprop(lr=1e-4), #’lr’ is the learning rate
metrics=['accuracy'])
#Fit the model on the training data
hist1 = myModel.fit(
trainGen,
steps_per_epoch=8,  #number of steps for each epoch
epochs=10, #number of epochs
verbose=1,
validation_data = validationGen,
validation_steps=8, #number of validation steps
callbacks = [time1]) ```

Sample output:

1. Train the model with GC used for optimization
``` time2 = TimeTaken()
#Compile the model
myModel.compile(loss='binary_crossentropy',
optimizer=gctf.optimizers.rmsprop(learning_rate = 1e-4),
metrics=['accuracy'])
#Fit the model on training data
hist2 = myModel.fit(
trainGen,
steps_per_epoch=8,
epochs=10,
verbose=1,
validation_data = validationGen,
validation_steps=8,
callbacks = [time2]) ```

Sample output:

1. Compare the results of execution with and without GC
``` comparisonData = [["Model w/o gctf:",sum(time1.times),hist1.history['accuracy'][-1],hist1.history['loss'][-1]],
["Model with gctf",sum(time2.times),hist2.history['accuracy'][-1],
hist2.history['loss'][-1]]]
#Tabulate the comparisonData’s information using tabulate() method
print(tabulate(comparisonData, headers=["Type","Execution time", "Accuracy", "Loss"])) ```

Sample output:

Note: The outputs may vary for each iteration and depending upon the execution environment you use.

• Conclusion: The tabular output shows that using GC for model optimization reduces the model training time as well as loss besides improving the model’s accuracy.
• Google colab notebook of the above implementation can be found here.

## References

Refer to the following sources to have an in-depth understanding of the Gradient Centralization technique and the related package for its Python implementation:

What Do You Think?

## Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
##### Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top