It’s a common myth universally acknowledged that a large, complex machine model must be better. However, the complexity and size of the model may not necessarily translate to good performance. Moreover, such models pose challenges such as difficulty in training and environmental costs.
Interestingly, famous ImageNet models such as AlexNet and VGG-16 have been compressed to up to 50 times their size without losing accuracy. The compression has increased their inference speed and ease of adaptation across several devices.
What Is Model Compression?
Model compression is the technique of deploying state-of-the-art deep networks in devices with low power and resources without compromising on the model’s accuracy. Compressing or reducing in size and/or latency means the model has fewer and smaller parameters and requires lesser RAM.
Since the late 1980s, researchers have been developing model compression techniques. Some of the important papers from that time include — Pruning vs clipping in neural networks (1989), A technique for trimming the fat from a network via relevance assessment (1989), and A simple procedure for pruning backpropagation trained neural networks (1990).
Of late, model compression has been drawing interest from the research community, especially after the 2012 ImageNet competition.
“This Imagenet 2012 event was definitely what triggered the big explosion of AI today. There were definitely some very promising results in speech recognition shortly before this (again many of them sparked by Toronto), but they didn’t take off publicly as much as that ImageNet win did in 2012 and the following years,” said Matthew Zeiler, an NYU Ph.D, winner of ImageNet competition in 2014.
Popular Model Compression Techniques
Pruning: This technique entails removing connections between the neurons, sometimes the whole neuron, channel or filter from a trained network. Pruning is done because networks tend to be over parameterised; multiple features convey almost the same information and are inconsequential in the large scheme of things.
Depending on the type of network component being removed, pruning can be classified into unstructured and structured pruning. In unstructured pruning, individual weights or neurons are removed, and in structured pruning, entire channels or filters are taken out.
Quantization: Unlike pruning, where the number of weights is reduced,
quantization involves decreasing the weights’ size. It is a process of mapping values from a large set to values in a smaller set. Meaning, the output contains a smaller range of values compared to the input without losing much information in the process.
Selective attention: Only the objects or elements of interest are
focused while the background and other elements are discarded. This technique requires the addition of a selective attention network upstream of the existing AI system.
Low-rank factorisation: This process uses matrix or tensor decomposition to estimate useful parameters. A weight matrix with greater dimension and rank can be replaced with smaller dimension matrices through factorisation.
Knowledge distillation: It is an indirect way of compressing a model
where an existing larger model, called teacher, trains smaller models called students. The goal is to have the same distribution in the student model as available in the teacher model. Here, the loss function is minimised during the transfer of knowledge from teacher to student.
Model compression continues to gather momentum. In 2019, MIT
researchers introduced the Lottery Ticket Hypothesis by improving on the traditional pruning technique. It refers to “a randomly-initialised, dense neural network contains a subnetwork that is initialised such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations”. Facebook AI found that the technique could be extended to reinforcement learning and natural language processing.
MIT assistant professor Song Han introduced AutoML for Model Compression (AMC). It leverages reinforcement learning to offer a model compression policy at a higher compression ratio, accuracy, and lower human effort. AutoML has now become an industry standard.
Further, companies such as Arm have taken a shine to TinyML, an embedded software technology used to build low power consuming devices to run ML models. As per global tech market advisory firm ABI Research, about 230 billion devices will be shipped with TinyML chipset by 2030. Model compression lies at the heart of TinyMLs.
Some of the major breakthroughs in recent years in model compression include:
- In the paper titled ‘Deep Neural Network Compression with Single and Multiple Level Quantization’, the authors proposed two novel quantization approaches — single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ).
- 3LC is a lossy compression scheme developed by Google to change traffic in distributed machine learning. The authors showed this scheme could ensure better performance on traffic reduction, accuracy, overhead, and generality fronts.
- In 2018, researchers from Samsung introduced the first universal DNN compression scheme using universal vector quantization and source coding.
- In 2019, researchers introduced a Multi-LayerPruning method (MLPrune) to decide compression ratios for all layers automatically.
Join Our Telegram Group. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am a journalist with a postgraduate degree in computer network engineering. When not reading or writing, one can find me doodling away to my heart’s content.