Last updated March 26, 2020

Can We Use Batch Normalisation For NLP

Share

Published on March 20, 2020

by Ram Sagar

Batch normalisation (batch norm) is one of the most widely used deep learning techniques to stabilise and accelerate training in deep neural networks. This technique helps decrease the number of parameter updates required to achieve low training error. This reduction in training duty led to the emergence of many improvements within the machine learning community.

Batch norm achieved state-of-the-art accuracies with 14 times fewer training steps. However, the use of batch norm is usually restricted to computer vision applications. Batch norm’s foray into NLP has been met with many challenges and degraded performance. Therefore, layer normalisation (LN) has been taking care of many NLP tasks.

To address the shortcomings of the batch norm in NLP, the researchers at the University of California, Berkeley, propose Power Normalisation (PN).

They use this novel normalisation scheme and

relax zero-mean normalisation in BN,
incorporate a running quadratic mean instead of per batch statistics to stabilise fluctuations, and
use an approximate backpropagation for incorporating the running statistics in the forward pass.

Overview Of Power Normalisation

The large computation and storage overhead of BN at each time-step in recurrent neural networks (RNNs) have been the major roadblocks to batch norm deployment for NLP.

And, rightfully, LN has become the defacto standard for many latest transformer models.

To analyse batch norm’s performance, the team from the University of California, studied the batch statistics using the standard setting of ResNet20 on Cifar-10 and TransformerBN on IWSLT14 (using a standard batch size of 128 and tokens of 4K, respectively).

In their first experiment, they probed the fluctuations between batch statistics (µ_B/σ_B), and the corresponding BN running statistics, µ/σ(mean/standard deviation), throughout the training.

The authors observe that TransformerBN shows significantly larger distances between the batch statistics and the running statistics than ResNet20 on Cifar-10, which exhibits close to zero fluctuations.

Power Normalisation (PN), claim the authors, effectively resolves the performance degradation of BN.

In the case of PN, the authors enforce unit quadratic mean instead of unit variance for the activations. The intuition here is that enforcing zero-mean and unit variance in BN is detrimental due to the large variations in the mean.

Their experiments show that unlike mean/variance, the unit quadratic mean is significantly more stable for transformers.

Even though TransformerPN-V outperforms TransformerBN, it still can not match the performance of LN. To solve this, the authors recommend to use running statistics for the quadratic mean instead of using per batch statistics.

These results were obtained from the experiments conducted on a variety of sequence modelling tasks such as neural Machine Translation (MT); and Language Modeling (LM).

For machine translation, the BLEU score was used as it is widely accepted in the NLP community.

Know more about power normalisation here.

Ever since the news of batch norm patenting by Google broke out, a lot of alternatives have surfaced. Though in the above case, the batch norm was tweaked to be used for NLP tasks, there are other options for the curious.

Here are a few:

Fixup Initialisation: Fixed-update initialisation (Fixup) was aimed at solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialisation.

Using Weight Normalisation: Weight normalisation accelerates the convergence of stochastic gradient descent optimisation by re-parameterising weight vectors in neural networks.

General Hamming Network (GHN): The researchers at Nokia technologies in their work illustrated that the celebrated batch normalisation (BN) technique actually adapts the “normalised” bias such that it approximates the rightful bias induced by the generalised hamming distance.

Group Normalisation (GN): GN divides the channels into groups and computes within each group the mean and variance for normalisation. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.

Switchable Normalisation (SN): Switchable Normalisation (SN) learns to select different normalisers for different normalisation layers of a deep neural network.

Attentive Normalisation (AN): Attentive Normalisation(AN) is a novel and lightweight integration of feature normalisation and feature channel-wise attention.

Access all our open Survey & Awards Nomination forms in one place