4 Key Techniques to Compress Machine Learning Models

Reducing the size of the model using these techniques can help reduce the inference time of the model
Listen to this story

Highly accurate machine learning models can be heavy, requiring a lot of computational power and thus reducing inference time. Speeding up the inference time of these models by compressing them into smaller models is a widely practised technique. By making the parameters smaller or fewer, based on the technique, the models can be made to use less RAM. This can also simplify the model, reducing the latency compared to the original model, thus increasing the inference speed. 

There are four heavily researched techniques popular for compressing machine learning models – 

  1. Quantisation
  2. Pruning
  3. Knowledge distillation
  4. Low-rank tensor decomposition


One of the most widely used methods for compressing models, quantisation, involves decreasing the size of the weights to improve efficiency. The smaller representations of the model weights by reducing them into smaller sizes reduces the size of the model along with increasing the speed of its processing and inference.


Simply put, the technique involves mapping values from the larger set into a smaller set, which results in the output consisting of a smaller range of values than the input set, ideally losing as little information as possible. For example, reducing images from 32-bit into 8-bit might result in the loss of information, but can achieve the goal of reducing the size of the machine learning model, thus increasing efficiency.

The goal of this technique is to reduce the size and precision of the network without reducing the noticeable difference in efficacy. 

You can read more about quantisation techniques for neural networks here.


Unlike quantisation that reduces the weights of the weights, pruning involves reducing the number of weights, by removing connection between channels, filters, and neurons. Pruning was introduced because oftentimes networks can be over-parameterised resulting in multiple nodes encoding the same information. 


In simple words, the process is about removing nodes to decrease the number of parameters. Depending on the task, there are two classification of pruning – 

Unstructured pruning is about removing individual neurons or weights. This process removes neurons and connections with zeros in the weights matrix, increasing the network’s sparsity, which is the ratio of zero to non-zero weights. 

Structured pruning involves removing complete filters and channels. Since it is about removing blocks of weights in the matrices, it does not occur in matrices with sparse connectivity patterns problems. 

Read more about pruning here.

Knowledge Distillation

Researchers from Cornell University figured out that the training model is usually larger than the inference model since they are trained without restriction on computational resources. The whole purpose of a trained model is to extract information and structure from the dataset as much as possible. But inference models face latency and resource consumption because they have to be deployed for results, therefore ways to compress them is a requirement. 


The researchers proposed that all the information gathered by the large training model can be transferred to a smaller model by training it to copy or mimic the larger model, which was later named as distillation.

How this technique works is that the trained model is called the “teacher” and the smaller model is called the “student”. The student is taught to minimise the loss function by training on ground truths and labelled truths in the network by the teacher, based on the distribution of class probabilities and the softmax function. 

Click here to check out a research paper about knowledge distillation.

Low-rank tensor decomposition

Over-parameterisation is one of the well-known issues in deep neural networks. A lot of repetitive, similar, and redundant outcomes can occur between different layers while training, especially in convolution neural networks for computer vision tasks. This technique involves reducing the number of repetitive images by approximating the numerous layers, thus reducing the memory footprint of the network, resulting in highly efficient systems. 


Also known as low-rank factorisation, this technique demonstrates itself as an effective means to achieve significant size and reduce latency by compression size of the parameters. The biggest advantage of using this technique for compression is that it does not require specialised hardware since it concerns only about reducing the parameter count.

Click here to read more about low-rank factorisation.

Download our Mobile App

Mohit Pandey
Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring