As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications such as online learning and incremental learning
In addition, recent years witnessed significant progress in virtual reality, augmented reality, and smart wearable devices, creating challenges in deploying deep learning systems to portable devices with limited resources (e.g. memory, CPU, energy, bandwidth).
Here are a few methods that are part of all compression techniques:
Parameter Pruning And Sharing
- Reducing redundant parameters which are not sensitive to the performance
- Robust to various settings
- Redundancies in the model parameters are explored and the uncritical yet redundant ones are removed
- Uses matrix decomposition to estimate the informative parameters of the deep convolutional neural networks
Transferred/Compact Convolutional Filters
- Special structural convolutional filters are designed to reduce the parameter space and save storage/computation
- A distilled model is used to train a more compact neural network to reproduce the output of a larger network
Now let’s take a look at a few papers that introduced novel compression models:
In this paper, the authors propose two novel network quantization approaches single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ).
The network quantization is considered from both width and depth level.
In this paper the authors proposed an efficient method for obtaining the rank configuration of the whole network. Unlike previous methods which consider each layer separately, this method considers the whole network to choose the right rank configuration.
3LC is a lossy compression scheme developed by the Google researchers that can be used for state change traffic in distributed machine learning (ML) that strikes a balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality. It combines three techniques — value quantization with sparsity multiplication, base encoding, and zero-run encoding.
This work for the first time, introduces universal DNN compression by universal vector quantization and universal source coding. In particular, this paper examines universal randomised lattice quantization of DNNs, which randomises DNN weights by uniform random dithering before lattice quantization and can perform near-optimally on any source without relying on knowledge of its probability distribution.
The compression (encoding) approach consists of transform and clustering with great encoding efficiency, which is expected to fulfill the requirements towards the future deep model communication and transmission standard. Overall, the framework works towards light weight model encoding pipeline with uniform quantization and clustering has yielded great compression performance, which can be further combined with existing deep model compression approaches towards light-weight models.
The encoding is based on the Bloomier filter, a probabilistic data structure that saves space at the cost of introducing random errors. The results show that this technique can compress DNN weights by up to 496x; with the same model accuracy, this results in up to a 1.51x improvement over the state-of-the-art.
The authors developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, they explored compression in networks with a range of different activation functions.
It is computationally expensive to manually set the compression ratio of each layer to find the sweet spot between size and accuracy of the model. So,in this paper, the authors propose a Multi-Layer Pruning method (MLPrune), which can automatically decide appropriate compression ratios for all layers.
Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments. The above-discussed techniques achieve not only higher model compression but also reduce the compute resources required during inferencing. This enables model deployment in mobile phones, IoT edge devices as well as “inferencing as a service” environments on the cloud.