Now Reading
What Are The Alternatives To Batch Normalization In Deep Learning?

What Are The Alternatives To Batch Normalization In Deep Learning?


In the original BatchNorm paper, the authors Sergey Ioffe and Christian Szegedy of Google introduced a method to address a phenomenon called internal covariate shift. 

This occurs because the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialisation. This makes training the models harder.



The introduction of batch normalized networks helped achieve state-of-the-art accuracies with 14 times fewer training steps. This reduction in training duty led to the emergence of many improvements within the machine learning community. 

For instance, Coconet is a fairly straightforward CNN with batch normalization. This gives Collaborative Convolutional Network (CoCoNet) more power to encode the fine-grained nature of the data with limited samples in an end-to-end fashion. The applications of CoCoNet was demonstrated recently when it was used to churn out Bach-like melodies with few clicks.

Batch normalization (batch norm) is often used in an attempt to stabilise and accelerate training in deep neural networks. However, in many cases, it indeed decreases the number of parameter updates required to achieve low training error.

The drawbacks accompanied with the surfacing of news such as those of Google’s claim for ownership of batch normalisation, the focus has now shifted to finding better alternatives. Here are few that have shown promising results:

Fixup Initialisation

Fixed-update initialization (Fixup) was aimed at solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialisation.

This method rescales the standard initialisation of residual branches by adjusting for the network architecture. Fixup enables training very deep residual networks stably at maximal learning rate without normalization.

When applied on image classification benchmarks CIFAR-10 (with Wide-ResNet) and ImageNet (with ResNet), Fixup with proper regularisation is found to match the well-tuned baseline trained with normalization.

Read the original paper here.

General Hamming Network(GHN)

The researchers at Nokia technologies in their work illustrated that the celebrated batch normalization (BN) technique actually adapts the “normalized” bias such that it approximates the rightful bias induced by the generalised hamming distance.

And, once the due bias is enforced analytically, neither the optimisation of bias terms nor the sophisticated batch normalization is needed.

The proposed generalised hamming network (GHN) demonstrated faster learning speeds, well-controlled behaviour and state-of-the-art performances on a variety of learning tasks

The results show that GHN benefits from a fast and robust learning process that is on par with that of the batch-normalization approach, yet without resorting to the sophisticated learning process of additional parameters

Read the original paper here

Group Normalization(GN)

GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with batch normalization BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN based counterparts for object detection and segmentation.

Read the original paper here.

Switchable Normalization(SN)

Switchable Normalization (SN) learns to select different normalizers for different normalization layers of a deep neural network. 

SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter.

SN combines 3 types of statistics estimated channel-wise, layer-wise, and minibatch-wise by using Instance Norm, Layer Norm, and Batch Norm respectively. SN switches among them by learning their importance weights.

Read the original paper here.

Attentive Normalization(AN)

BN and its variants take into account different ways of computing the mean and variance within a mini-batch for feature normalization, followed by a learnable channel-wise affine transformation.

Attentive Normalization(AN) is a novel and lightweight integration of feature normalization and feature channel-wise attention. AN learns a small number of scale and offset parameters per channel (i.e., different affine transformations). Their weighted sums (i.e., mixture) are used in the final affine transformation. 

AN is complementary and applicable to existing variants of BN. In experiments, the results show that using AN in the ImageNet-1K classification dataset and the MS-COCO object detection and instance segmentation dataset with significantly better performance obtained than the vanilla BN. Our AN also outperforms two state-of-the-art variants of BN, GN and SN. 

Read the original paper here.

Online Normalization

Online Normalization is a new technique for normalizing the hidden activations of a neural network. Like Batch Normalization, it normalizes the sample dimension. While Online Normalization does not use batches, it is as accurate as Batch Normalization. The idea here is to resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activations. 

Online Normalization works with automatic differentiation by adding statistical normalization as a primitive. This technique can be used in cases not covered by some other normalizers, such as recurrent networks, fully connected networks, and networks with activation memory requirements prohibitive for batching. 

Read the original paper here.

Equi-Normalization Of Neural Networks 

Inspired by the Sinkhorn- Knopp algorithm, the researchers introduced a fast iterative method for minimising the L2 norm of the weights, equivalently the weight decay regulariser, which provably converges to a unique solution. 

The results show that interleaving this algorithm with stochastic gradient descent(SGD) during training improves the test accuracy. Especially for small batches, this approach offers an alternative to batch- and group- normalization on CIFAR-10 and ImageNet with a ResNet-18.

See Also

Read the original paper here.

Using Weight Normalization

Weight normalization accelerates the convergence of stochastic gradient descent optimisation by re-parameterising weight vectors in neural networks. However, previous works have not studied initialisation strategies for weight normalization

The researchers at Salesforce Research along with AI pioneer Yoshua Bengio, proposed a new strategy that is based on a theoretical analysis using mean field approximation. 

They ran over 2,500 experiments and evaluate their proposal on image datasets showing that the proposed initialisation outperforms existing initialisation methods in terms of generalisation performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. 

The results show that using this initialisation in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.

Read the original paper here.

Stochastic Normalizations as Bayesian Learning

Using this method, the researchers showed that their generalisation performance can be improved significantly by Bayesian learning of the same form. We obtain test performance comparable to BN and, at the same time, better validation losses suitable for subsequent output uncertainty estimation through approximate Bayesian posterior.

On investigation, they found the reasons why Batch Normalization (BN) improves the generalisation performance of deep networks. Of which, one reason being the randomness of batch statistics. This randomness appears in the parameters rather than in activations and admits an interpretation as practical Bayesian learning. 

Read the original paper here.

Self Normalizing Neural Networks(SNNs)

Batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are “scaled exponential linear units” (SELUs), which induce self-normalizing properties. 

The convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularisation, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible.

Read the original paper here.

Further reading sources:

Check this discussion on Batch-Norm.


Enjoyed this story? Join our Telegram group. And be part of an engaging community.


FEATURED VIDEO

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
Scroll To Top