MITB Banner

4 Key Techniques to Compress Machine Learning Models

Reducing the size of the model using these techniques can help reduce the inference time of the model

Share

Listen to this story

Highly accurate machine learning models can be heavy, requiring a lot of computational power and thus reducing inference time. Speeding up the inference time of these models by compressing them into smaller models is a widely practised technique. By making the parameters smaller or fewer, based on the technique, the models can be made to use less RAM. This can also simplify the model, reducing the latency compared to the original model, thus increasing the inference speed. 

There are four heavily researched techniques popular for compressing machine learning models – 

  1. Quantisation
  2. Pruning
  3. Knowledge distillation
  4. Low-rank tensor decomposition

Quantisation

One of the most widely used methods for compressing models, quantisation, involves decreasing the size of the weights to improve efficiency. The smaller representations of the model weights by reducing them into smaller sizes reduces the size of the model along with increasing the speed of its processing and inference.

Source

Simply put, the technique involves mapping values from the larger set into a smaller set, which results in the output consisting of a smaller range of values than the input set, ideally losing as little information as possible. For example, reducing images from 32-bit into 8-bit might result in the loss of information, but can achieve the goal of reducing the size of the machine learning model, thus increasing efficiency.

The goal of this technique is to reduce the size and precision of the network without reducing the noticeable difference in efficacy. 

You can read more about quantisation techniques for neural networks here.

Pruning

Unlike quantisation that reduces the weights of the weights, pruning involves reducing the number of weights, by removing connection between channels, filters, and neurons. Pruning was introduced because oftentimes networks can be over-parameterised resulting in multiple nodes encoding the same information. 

Source

In simple words, the process is about removing nodes to decrease the number of parameters. Depending on the task, there are two classification of pruning – 

Unstructured pruning is about removing individual neurons or weights. This process removes neurons and connections with zeros in the weights matrix, increasing the network’s sparsity, which is the ratio of zero to non-zero weights. 

Structured pruning involves removing complete filters and channels. Since it is about removing blocks of weights in the matrices, it does not occur in matrices with sparse connectivity patterns problems. 

Read more about pruning here.

Knowledge Distillation

Researchers from Cornell University figured out that the training model is usually larger than the inference model since they are trained without restriction on computational resources. The whole purpose of a trained model is to extract information and structure from the dataset as much as possible. But inference models face latency and resource consumption because they have to be deployed for results, therefore ways to compress them is a requirement. 

Source

The researchers proposed that all the information gathered by the large training model can be transferred to a smaller model by training it to copy or mimic the larger model, which was later named as distillation.

How this technique works is that the trained model is called the “teacher” and the smaller model is called the “student”. The student is taught to minimise the loss function by training on ground truths and labelled truths in the network by the teacher, based on the distribution of class probabilities and the softmax function. 

Click here to check out a research paper about knowledge distillation.

Low-rank tensor decomposition

Over-parameterisation is one of the well-known issues in deep neural networks. A lot of repetitive, similar, and redundant outcomes can occur between different layers while training, especially in convolution neural networks for computer vision tasks. This technique involves reducing the number of repetitive images by approximating the numerous layers, thus reducing the memory footprint of the network, resulting in highly efficient systems. 

Source

Also known as low-rank factorisation, this technique demonstrates itself as an effective means to achieve significant size and reduce latency by compression size of the parameters. The biggest advantage of using this technique for compression is that it does not require specialised hardware since it concerns only about reducing the parameter count.

Click here to read more about low-rank factorisation.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.