Batch normalisation (batch norm) is one of the most widely used deep learning techniques to stabilise and accelerate training in deep neural networks. This technique helps decrease the number of parameter updates required to achieve low training error. This reduction in training duty led to the emergence of many improvements within the machine learning community. 

Batch norm achieved state-of-the-art accuracies with 14 times fewer training steps. However, the use of batch norm is usually restricted to computer vision applications. Batch norm’s foray into NLP has been met with many challenges and degraded performance. Therefore, layer normalisation (LN) has been taking care of many NLP tasks.

To address the shortcomings of the batch norm in NLP, the researchers at the University of California, Berkeley, propose Power Normalisation (PN).

They use this novel normalisation scheme and

Overview Of Power Normalisation

The large computation and storage overhead of BN at each time-step in recurrent neural networks (RNNs) have been the major roadblocks to batch norm deployment for NLP.

And, rightfully, LN has become the defacto standard for many latest transformer models.

To analyse batch norm’s performance, the team from the University of California, studied the batch statistics using the standard setting of ResNet20 on Cifar-10 and TransformerBN on IWSLT14 (using a standard batch size of 128 and tokens of 4K, respectively).

In their first experiment, they probed the fluctuations between batch statistics (µ_B/σ_B), and the corresponding BN running statistics, µ/σ(mean/standard deviation), throughout the training.

The authors observe that TransformerBN shows significantly larger distances between the batch statistics and the running statistics than ResNet20 on Cifar-10, which exhibits close to zero fluctuations.

Power Normalisation (PN), claim the authors, effectively resolves the performance degradation of BN.

In the case of PN, the authors enforce unit quadratic mean instead of unit variance for the activations. The intuition here is that enforcing zero-mean and unit variance in BN is detrimental due to the large variations in the mean.

via Econometrics

Their experiments show that unlike mean/variance, the unit quadratic mean is significantly more stable for transformers. 

Even though TransformerPN-V outperforms TransformerBN, it still can not match the performance of LN. To solve this, the authors recommend to use running statistics for the quadratic mean instead of using per batch statistics.

These results were obtained from the experiments conducted on a variety of sequence modelling tasks such as neural Machine Translation (MT); and Language Modeling (LM). 

For machine translation, the BLEU score was used as it is widely accepted in the NLP community. 

Know more about power normalisation here.

Ever since the news of batch norm patenting by Google broke out, a lot of alternatives have surfaced. Though in the above case, the batch norm was tweaked to be used for NLP tasks, there are other options for the curious.

Here are a few: