I have a master's degree in Robotics and I write…

A neural network is built around simple linear equations like Y = WX + B, which contain something called as weights W. These weights get multiplied with the input X and thus plays a crucial in how the model predicts.

Most of the computations in deep neural networks are multiplications between float-valued weights and float-valued activations during the forward inference.

The prediction scores can even go downhill if a wrong weight gets updated and as the network gets deeper i.e addition of more layers and columns of connected nodes, the error gets magnified and the results miss the target.

To make models lighter while not keeping the efficiency intact, many solutions have been developed, and one such solution is neural compression.

When we say neural compression, it actually means is the combination of the following techniques:

- Parameter Pruning And Sharing
- Low-Rank Factorisation
- Transferred/Compact Convolutional Filters
- Knowledge Distillation

However, these methods facilitate a faster way of training models but do not eliminate underlying operations.

### Can We Avoid Multiplication Altogether?

Convolutions are the gold standard of machine vision models, a default operation to extract features from visual data. And there hardly has been any attempt to replace convolution with another more efficient similarity measure, and that is why its better to only involve additions.

Instead of developing software and hardware solutions to cater for faster multiplications between layers, can we train models without multiplication?

To answer this question, researchers from Huawei labs and Peking University in collaboration with the University of Sydney have come up with AdderNet or adder networks that trade massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs.

The notion here is that adding two numbers is easy compared to multiplying two numbers.

A norm in the context of linear algebra is the total length of all vectors in space.

For the vector, say X = [3,4]

The L 1 norm is calculated as:

The underlying working of AdderNets, according to Hanting Chen et al., is given as follows:

**Input**: An initialised adder network N with its training set X and the corresponding labels Y, along with the global learning rate γ and the hyper-parameter η.

- Repeat
- Select a batch {(x, y)} randomly from X and Y
- Run AdderNet ‘N’ on the mini-batch: x → N (x)
- Calculate ∂Y/∂F and ∂Y/∂X for adder filters
- Use the chain rule to generate the gradient of parameters in N
- Calculate the adaptive learning rate for each adder layer
- Update the parameters of the adder networks using stochastic gradient descent.
- Repeat until convergence

**Output**: A well-trained adder network N with almost no multiplications.

To validate the effectiveness of AdderNets, the following setup is used:

**Benchmark datasets:** MNIST, CIFAR and ImageNet.

**Hardware:** NVIDIA Tesla V100 GPU

**Framework: **PyTorch.

The results from the MNIST experiment show that the convolutional neural network achieves a 99.4% accuracy with ∼435K multiplications and ∼435K additions. By replacing the multiplications in convolution with additions, the proposed AdderNet achieves a 99.4% accuracy, which is the same as that of CNNs, with ∼870K additions and almost no multiplication.

The biggest difference between CNNs and AdderNets is that the convolutional neural network calculates the cross-correlation between filters and inputs. If filters and inputs are approximately normalised, the convolution operation then becomes equivalent to cosine distance between two vectors.

AdderNets on the other hand, utilise the L1-norm to distinguish different classes. Thus, the features tend to be clustered towards different class centres.

### What Are The Implications

Features of CNNs in different classes are divided by their angles. In contrast, features of AdderNets tend to be clustered towards different class centres, since AdderNets use the L1-norm to distinguish different classes.

However, AdderNets still have a long way to go.

For example, let’s say, X is the input feature, F is filter and Y is the output, the difference between the CNNs and AdderNets can be seen in the way where their variances are approximated as:

**CNN**

**AdderNet**

Usually, Var[F] or variance of the filter is a small value (~0.003). So, multiplying Var[F] in case of CNNs will result in smaller variances, which in turn will lead to a smooth flow of information in the network.

Whereas due to addition in the AdderNets, the variance is larger, which means the gradient w.r.t X is smaller, and hence this will slow down the network updating.

AdderNets were proposed to make machine learning a lightweight task and we are here, already trading time. To avoid large variance effects, the authors in their work, recommend the use of an adaptive learning rate for different layers in AdderNet.

Machine learning is computationally intensive and there is always a tradeoff between accuracy and inference time(speed).

The high-power consumption of these high-end GPU cards has hindered the state-of-the-art machine learning models from being deployed on smartphones and other wearables.

Though companies like Apple with their A13 bionic chips are revolutionising deep learning for mobiles, it is required to have an effective investigation of the techniques that have been overlooked. Something as scary as imagining convolutions without multiplications can result in models like AdderNets.

*Enjoyed this story? Join our Telegram group. And be part of an engaging community.*

### Provide your comments below

###### What's Your Reaction?

I have a master's degree in Robotics and I write about machine learning advancements. email:ram.sagar@analyticsindiamag.com