MITB Banner

Can We Do Deep Learning Without Multiplications?

Share

A neural network is built around simple linear equations like Y = WX + B, which contain something called as weights W. These weights get multiplied with the input X and thus plays a crucial in how the model predicts. 

Most of the computations in deep neural networks are multiplications between float-valued weights and float-valued activations during the forward inference.

The prediction scores can even go downhill if a wrong weight gets updated and as the network gets deeper i.e addition of more layers and columns of connected nodes, the error gets magnified and the results miss the target. 

via Google cloud blog

To make models lighter while not keeping the efficiency intact, many solutions have been developed, and one such solution is neural compression.

When we say neural compression, it actually means is the combination of the following techniques:

  • Parameter Pruning And Sharing
  • Low-Rank Factorisation
  • Transferred/Compact Convolutional Filters
  • Knowledge Distillation

However, these methods facilitate a faster way of training models but do not eliminate underlying operations.

Can We Avoid Multiplication Altogether?

via Google cloud blog

Convolutions are the gold standard of machine vision models, a default operation to extract features from visual data. And there hardly has been any attempt to replace convolution with another more efficient similarity measure, and that is why its better to only involve additions.

Instead of developing software and hardware solutions to cater for faster multiplications between layers, can we train models without multiplication?

To answer this question, researchers from Huawei labs and Peking University in collaboration with the University of Sydney have come up with AdderNet or adder networks that trade massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. 

The notion here is that adding two numbers is easy compared to multiplying two numbers.

A norm in the context of linear algebra is the total length of all vectors in space.

For the vector, say X = [3,4]

The L 1 norm is calculated as:

The underlying working of AdderNets, according to Hanting Chen et al., is given as follows:

Input: An initialised adder network N with its training set X and the corresponding labels Y, along with the global learning rate γ and the hyper-parameter η. 

  1. Repeat 
  2. Select a batch {(x, y)} randomly from X and Y
  3. Run AdderNet ‘N’ on the mini-batch: x → N (x)
  4. Calculate ∂Y/∂F and ∂Y/∂X for adder filters
  5. Use the chain rule to generate the gradient of parameters in N
  6. Calculate the adaptive learning rate for each adder layer
  7. Update the parameters of the adder networks using stochastic gradient descent.
  8. Repeat until convergence 

Output: A well-trained adder network N with almost no multiplications.

To validate the effectiveness of AdderNets, the following setup is used:

Benchmark datasets: MNIST, CIFAR and ImageNet.

Hardware: NVIDIA Tesla V100 GPU

Framework: PyTorch.

The results from the MNIST experiment show that the convolutional neural network achieves a 99.4% accuracy with ∼435K multiplications and ∼435K additions. By replacing the multiplications in convolution with additions, the proposed AdderNet achieves a 99.4% accuracy, which is the same as that of CNNs, with ∼870K additions and almost no multiplication.

The biggest difference between CNNs and AdderNets is that the convolutional neural network calculates the cross-correlation between filters and inputs. If filters and inputs are approximately normalised, the convolution operation then becomes equivalent to cosine distance between two vectors. 

AdderNets on the other hand, utilise the L1-norm to distinguish different classes. Thus, the features tend to be clustered towards different class centres.

What Are The Implications

Features of CNNs in different classes are divided by their angles. In contrast, features of AdderNets tend to be clustered towards different class centres, since AdderNets use the L1-norm to distinguish different classes.

However, AdderNets still have a long way to go.

For example, let’s say, X is the input feature, F is filter and Y is the output, the difference between the CNNs and AdderNets can be seen in the way where their variances are approximated as:

  1. CNN
  1. AdderNet

Usually, Var[F] or variance of the filter is a small value (~0.003). So, multiplying Var[F] in case of CNNs will result in smaller variances, which in turn will lead to a smooth flow of information in the network.

Whereas due to addition in the AdderNets, the variance is larger, which means the gradient w.r.t X is smaller, and hence this will slow down the network updating.

AdderNets were proposed to make machine learning a lightweight task and we are here, already trading time. To avoid large variance effects, the authors in their work, recommend the use of an adaptive learning rate for different layers in AdderNet.

Machine learning is computationally intensive and there is always a tradeoff between accuracy and inference time(speed). 

The high-power consumption of these high-end GPU cards has hindered the state-of-the-art machine learning models from being deployed on smartphones and other wearables.

Though companies like Apple with their A13 bionic chips are revolutionising deep learning for mobiles, it is required to have an effective investigation of the techniques that have been overlooked. Something as scary as imagining convolutions without multiplications can result in models like AdderNets. 

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India