MITB Banner

What is Apple’s Quant for Neural Networks Quantization

Quantization is the process of mapping the high precision values (a large set of possible values) to low precision values(a smaller set of possible values). Quantization can be done on both weights and activations of a model.
Share

Large Neural Networks are difficult to use in production environments as they are memory intensive and are slow during inference. Most successful Deep Learning Models such as Transformers are being followed by their Lite Versions which dramatically speed up inference trading off accuracy. In this article, let’s explore Least Squares Quantization, an algorithm to speed up large neural networks by quantizing them while reducing the accuracy gap from the non-quantized  model.

Hadi Pouransari, Zhucheng Tu, Oncel Tuzel, researchers at Apple, introduced this approach in a paper- Least Squares Binary Quantization of Neural Networks, on 23rd March 2020.

Inference Time Reduction

We all agree that smaller models are better for practical purposes in memory usage and inference time. But the performance of smaller models is often low when compared to large models. We need to speed up the larger models without modifying them to an extent that hits the performance. The following are the Methods to improve the run time of models.

Hardware Optimization

Hardware is used for model inference. This can boost performance significantly. GPUs and TPUs(Tensor Processing Unit) are examples of this kind of optimization.

Compiler Optimization

Compiler Optimization means fusing operations or implementing them in a hardware aware manner to reduce computational complexity. Dense and Sparse Matrix-vector multiplications are an example of such Optimization.

Model Optimization

Modifying the model architecture or compressing the model leads to an increase in performance. Such optimizations come under model Optimization. Least squares quantization is an example of Model Optimization.

Quantization

We know that using low precision numbers speed up the arithmetic operations on them. Hence Precision of weights and activations of models is reduced to speed up models. This can be done post-training or during training to allow the model to adjust for the low precision. Precisions can also be varied, and a mixed precision model can be made i.e each layer can be of different precisions.

Quantization is the process of mapping the high precision values (a large set of possible values) to low precision values(a smaller set of possible values). Quantization can be done on both weights and activations of a model. Quantization reduces the complexity of operations on these values. 1 bit Quantization(each scalar in a tensor is represented by 1 bit) can reduce complex convolutional layers into XNOR gates and counting the number of 1’s.

There are several ways to do quantization. Scaled Quantization is the proposed method in the paper. This method decomposes the variable matrix into two parts.One part is the scale(X1) and Other part is the matrix of the same dimensions as the original one. This new matrix consists(S) of only {-1,+1}. 

Thus we reduced the space of this variable. Following sections will make it easier to understand the scaled quantization.

Least Squares Quantization

Intuitively, quantization leads to a reduction in accuracy as the event space of the variables is reduced. But this accuracy drop can be reduced if our quantization achieves better approximation of the original variables. To enforce this, we need to optimize another objective function apart from the model’s objective function.

This new function is the Quantization error.We need to reduce this quantization error. Frobenius norm is one of the ways to calculate such error. If x is the original variable and xq is the quantized version of this variable, then Quantization  error is || x- xq ||2 

Let’s see an example of 1 bit scaled Quantization.

  • x is the input variable to quantization
  • xq is the output of quantization.
  • v is the scale of quantization.
  • s(x) is a function that gives -1 or +1.

Example: [-5 -4 -3 3 4 5] This vector when quantized becomes 4* [-1 -1 -1 1 1 1].

The least Squares Optimization of this quantization is

Solving this equation would result in v being the mean of x. s(x) is the sign function.

This can be generalized to k bit Quantization. The Difference is that we will have k pairs of v’s and s functions.

For k bit quantization we can find v’s and s’s by following the greedy algorithm mentioned below.

This is analogous to Gradient Boosting; at each step, we approximate the value then we calculate a residual and try to approximate the residual again. This process is repeated K times.

Quantization Example

Enough theory about quantization. Let’s see how it affects a model on the CIFAR 100 dataset.

CIFAR100 contains colour images that need to be classified into one of 100 classes.

Setup 

 !git clone https://github.com/apple/ml-quant.git
 %cd /content/ml-quant
 !pip install -U pip wheel
 !pip install -r requirements.txt
 !pip install flit
 import os
 os.environ['FLIT_ROOT_INSTALL'] = '1'
 !flit install -s 

Let’s see how a full precision resnet model performs on this dataset.We can train resnet using the following command

python examples/cifar100/cifar100.py --config examples/cifar100/cifar100_fp.yaml --experiment-name cifar100-fp

Now let us see how a quantized model compares to this model. We will use knowledge distillation to teach the quantized model. Full precision model can be used as reference for this.

To make the quantized model refer to the full precision model, we need to edit the config file and set the teacher path.

Command to train quantized model

!python examples/cifar100/cifar100.py --config examples/cifar100/cifar100_ls1_weight_ls2_activation_kd.yaml --experiment-name cifar100-ls2

Results

Loss
Accuracy of the top predicted class
Accuracy of top 5 predicted classes

From these results, we can see that the quantized model almost achieved the same results as the original full precision ResNet model on CIFAR 100 dataset.

Conclusion

Several Real-world use cases of Machine Learning require products with the latency of a few seconds. Trading off Accuracy for this speed is possible with techniques like Quantization. We should explore such techniques to broaden our Machine Learning toolkit.

References

PS: The story was written using a keyboard.
Share
Picture of Pavan Kandru

Pavan Kandru

AI enthusiast with a flair for NLP. I love playing with exotic data.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India