MITB Banner

Can We Use Batch Normalisation For NLP

Share

Batch normalisation (batch norm) is one of the most widely used deep learning techniques to stabilise and accelerate training in deep neural networks. This technique helps decrease the number of parameter updates required to achieve low training error. This reduction in training duty led to the emergence of many improvements within the machine learning community. 

Batch norm achieved state-of-the-art accuracies with 14 times fewer training steps. However, the use of batch norm is usually restricted to computer vision applications. Batch norm’s foray into NLP has been met with many challenges and degraded performance. Therefore, layer normalisation (LN) has been taking care of many NLP tasks.

To address the shortcomings of the batch norm in NLP, the researchers at the University of California, Berkeley, propose Power Normalisation (PN).

They use this novel normalisation scheme and

  • relax zero-mean normalisation in BN, 
  • incorporate a running quadratic mean instead of per batch statistics to stabilise fluctuations, and
  • use an approximate backpropagation for incorporating the running statistics in the forward pass. 

Overview Of Power Normalisation

The large computation and storage overhead of BN at each time-step in recurrent neural networks (RNNs) have been the major roadblocks to batch norm deployment for NLP.

And, rightfully, LN has become the defacto standard for many latest transformer models.

To analyse batch norm’s performance, the team from the University of California, studied the batch statistics using the standard setting of ResNet20 on Cifar-10 and TransformerBN on IWSLT14 (using a standard batch size of 128 and tokens of 4K, respectively).

In their first experiment, they probed the fluctuations between batch statistics (µ_B/σ_B), and the corresponding BN running statistics, µ/σ(mean/standard deviation), throughout the training.

The authors observe that TransformerBN shows significantly larger distances between the batch statistics and the running statistics than ResNet20 on Cifar-10, which exhibits close to zero fluctuations.

Power Normalisation (PN), claim the authors, effectively resolves the performance degradation of BN.

In the case of PN, the authors enforce unit quadratic mean instead of unit variance for the activations. The intuition here is that enforcing zero-mean and unit variance in BN is detrimental due to the large variations in the mean.

via Econometrics

Their experiments show that unlike mean/variance, the unit quadratic mean is significantly more stable for transformers. 

Even though TransformerPN-V outperforms TransformerBN, it still can not match the performance of LN. To solve this, the authors recommend to use running statistics for the quadratic mean instead of using per batch statistics.

These results were obtained from the experiments conducted on a variety of sequence modelling tasks such as neural Machine Translation (MT); and Language Modeling (LM). 

For machine translation, the BLEU score was used as it is widely accepted in the NLP community. 

Know more about power normalisation here.

Ever since the news of batch norm patenting by Google broke out, a lot of alternatives have surfaced. Though in the above case, the batch norm was tweaked to be used for NLP tasks, there are other options for the curious.

Here are a few:

  • Fixup Initialisation: Fixed-update initialisation (Fixup) was aimed at solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialisation.
  • Using Weight Normalisation: Weight normalisation accelerates the convergence of stochastic gradient descent optimisation by re-parameterising weight vectors in neural networks. 
  • General Hamming Network (GHN): The researchers at Nokia technologies in their work illustrated that the celebrated batch normalisation (BN) technique actually adapts the “normalised” bias such that it approximates the rightful bias induced by the generalised hamming distance.
  • Group Normalisation (GN): GN divides the channels into groups and computes within each group the mean and variance for normalisation. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.
  • Switchable Normalisation (SN): Switchable Normalisation (SN) learns to select different normalisers for different normalisation layers of a deep neural network.
  • Attentive Normalisation (AN): Attentive Normalisation(AN) is a novel and lightweight integration of feature normalisation and feature channel-wise attention.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.