The TensorFlow Model Optimization team from Google recently released Quantization Aware Training (QAT) API as part of the TensorFlow Model Optimization Toolkit. According to the team, the API will enable training and deploying machine learning models with improved performance; these would be compact despite maintaining maximum accuracy.
Quantization is the technique of transforming a machine learning model into an equivalent representation, which uses parameters and computations at a lower precision. This technique helps in improving the execution performance, as well as the efficiency of the AI model. Furthermore, this technique allows an AI model to execute on specialized neural accelerators, which often has a restricted set of data types, such as Edge TPU in Coral.
Quantization Aware Training (QAT) API
Quantization Aware Training surpasses inference-time quantization, creating a model that downstream tools will use to produce quantized models. These quantized models usually use lower-precision, which provides benefits during the deployment of a model. The technique can be used in production in speech, vision, text, and translate use cases.
The researchers trained QAT accuracy numbers with the default TensorFlow Lite configuration and contrasted with the floating-point baseline and post-training quantized models. This showed that the QAT-trained models have comparable accuracy to floating-point.
Why Use QAT API
As mentioned earlier, quantization transforms a machine learning model into an equivalent representation that uses parameters and computations at a lower precision. However, the process of going from higher to lower precision can result in lossy and noisy outcomes.
This is because quantization squeezes a small range of floating-point values into a fixed number of information buckets. The parameters or weights of a model can only take a small set of values, and the minute differences between them are lost. This, in result, leads to information loss and introduces computational errors.
Quantization Aware Training overcomes this loss issue by stimulating low-precision inference-time computation in the forward pass of the training process. Using this API, the AI model learns parameters that are more robust to quantization.
Features of Quantization Aware Training
The goal of this API is to reduce the size, latency as well as consumption of power while maintaining negligible accuracy loss. Quantization Aware Training (QAT) can be used in production in speech, vision, text, and translation of use cases. According to the team, this tool can also be useful for researchers and hardware designers who may want to experiment with various quantization strategies and simulate how quantization affects accuracy for different hardware backends.
The QAT API is flexible and capable of handling complicated use cases. For instance, this API allows a user to control quantization precisely within a layer, create custom quantization algorithms, and handle any custom layers that have been written.
The QAT API provides a simple and highly flexible way to quantize any TensorFlow Keras model, which makes it easy to train with “quantization awareness” for an entire model or only parts of it, then export it for deployment with TensorFlow Lite.
Steps To Quantize the Entire Keras Model
Click here to know more.
API Compatibility
Users can apply quantization with the following APIs:
- Model building: tf.keras with only Sequential and Functional models
- TensorFlow versions: TF 2.x for tf-nightly
- TensorFlow execution mode: eager execution
Wrapping Up
By default, the QAT API is configured to work with the quantized execution support available in TensorFlow Lite. Furthermore, the TensorFlow Team will enhance the QAT API by adding features like model-building to clarify how sub-classed models have limited to no support, distributed training, model coverage to include RNN/LSTMs and general Concat support, hardware acceleration to ensure the TFLite converter can produce full-integer models, and more.