MITB Banner

Watch More

8 Neural Network Compression Techniques For ML Developers

As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications such as online learning and incremental learning

In addition, recent years witnessed significant progress in virtual reality, augmented reality, and smart wearable devices, creating challenges in deploying deep learning systems to portable devices with limited resources (e.g. memory, CPU, energy, bandwidth).

Here are a few methods that are part of all compression techniques:

Parameter Pruning And Sharing

  • Reducing redundant parameters which are not sensitive to the performance
  • Robust to various settings
  • Redundancies in the model parameters are explored and the uncritical yet redundant ones are removed

Low-Rank Factorisation

  • Uses matrix decomposition to estimate the informative parameters of the deep convolutional neural networks

Transferred/Compact Convolutional Filters

  • Special structural convolutional filters are designed to reduce the parameter space and save storage/computation

Knowledge Distillation

  • A distilled model is used to train a more compact neural network to reproduce the output of a larger network

Now let’s take a look at a few papers that introduced novel compression models:

1.Deep Neural Network Compression with Single and Multiple Level Quantization

In this paper, the authors propose two novel network quantization approaches single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ).

The network quantization is considered from both width and depth level.

2.Efficient Neural Network Compression

In this paper the authors proposed an efficient method for obtaining the rank configuration of the whole network. Unlike previous methods which consider each layer separately, this method considers the whole network to choose the right rank configuration.

3.3LC: Lightweight and Effective Traffic Compression

3LC is a lossy compression scheme developed by the Google researchers that can be used for state change traffic in distributed machine learning (ML) that strikes a balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality. It combines three techniques — value quantization with sparsity multiplication, base encoding, and zero-run encoding.

4.Universal Deep Neural Network Compression

This work for the first time, introduces universal DNN compression by universal vector quantization and universal source coding. In particular, this paper examines universal randomised lattice quantization of DNNs, which randomises DNN weights by uniform random dithering before lattice quantization and can perform near-optimally on any source without relying on knowledge of its probability distribution.

5.Compression using Transform Coding and Clustering

The compression (encoding) approach consists of transform and clustering with great encoding efficiency, which is expected to fulfill the requirements towards the future deep model communication and transmission standard. Overall, the framework works towards light weight model encoding pipeline with uniform quantization and clustering has yielded great compression performance, which can be further combined with existing deep model compression approaches towards light-weight models.

6.Weightless: Lossy Weight Encoding

The encoding is based on the Bloomier filter, a probabilistic data structure that saves space at the cost of introducing random errors. The results show that this technique can compress DNN weights by up to 496x; with the same model accuracy, this results in up to a 1.51x improvement over the state-of-the-art.

7.Adaptive Estimators Show Information Compression

The authors developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, they explored compression in networks with a range of different activation functions. 

8.MLPrune: Multi-Layer Pruning For Neural Network Compression

It is computationally expensive to manually set the compression ratio of each layer to find the sweet spot between size and accuracy of the model. So,in this paper, the authors propose a Multi-Layer Pruning method (MLPrune), which can automatically decide appropriate compression ratios for all layers.

Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments. The above-discussed techniques achieve not only higher model compression but also reduce the compute resources required during inferencing. This enables model deployment in mobile phones, IoT edge devices as well as “inferencing as a service” environments on the cloud. 

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories