Earlier this week, Hugging Face launched a new open-source library called Optimum, an optimisation toolkit for transformers at scale. This toolkit also enables maximum efficiency to train and run models on specific hardware.
Check out the source code for Optimum here.
Sign up for your weekly dose of what's up in emerging technology.
From transformers for the Tesla autopilot to Gmail completing your sentences, Facebook translating your posts on the fly, and Bing answering your natural language queries, billions of transformer models are predicted daily.
Transformers have brought a step-change improvement in the accuracy of machine learning models. More than anything, it has conquered NLP and is now expanding to other areas, including speech and vision. However, despite such advancement, taking these models into production and making them run fast at scale is still a challenge for any machine learning engineering team.
With Optimum, Hugging Face looks to build a definitive toolkit for transformers production performance and help scale transformers.
Here’s how it works
In order to get optimal performance training and serving models, the model acceleration methods need to be compatible with the targeted hardware. Hardware platforms have a huge impact on performance. To take advantage of advanced model acceleration methods such as sparsity and quantisation, optimised kernels need to be compatible with the operators on silicon. It also needs to be specific to the neural network graph derived from the model architecture.
Plus, diving into this three-dimensional compatibility matrix and how to use model acceleration libraries is cumbersome. Hugging Face’s open-source library Optimum looks to make this work easy by providing performance optimisation tools targeting efficient AI hardware, developed in collaboration with its hardware partners.
With its Transformers library, Hugging Face claimed to have made it easy for researchers and engineers to use SOTA models, eliminating the complexity of architectures, frameworks, and pipelines. Similarly, with Optimum, the team looks to make it easy for engineers to leverage all the available hardware features at their disposal, removing the complexity of model acceleration on hardware platforms.
How Optimum is different
In its blog post, Hugging Face showed the use of Optimum and how to quantise a model for Intel Xeon CPU. While pre-trained language models like BERT have achieved SOTA results on a wide range of natural language processing tasks, transformers such as ViT and Speech2Text have achieved SOTA results on computer vision and speech tasks.
When it comes to putting transformer-based models into production, it is tricky and expensive as they need a lot of computing power to work. One of the most popular techniques to solve this problem is quantisation. However, it requires a lot of work. Here’s why:
- The model needs to be edited: Some ops need to be replaced by their quantised counterparts, new ops need to be inserted, and others need to be adapted to the fact that weights and activations will be quantised.
- Once edited, there are many parameters to play with to find the best quantisation settings. Some of the questions include:
- Which kind of observers should be used for range calibration?
- Which quantisation scheme needs to be used?
- Does the targeted device support int8, or should it stay in unint8?
- Balance the trade-off between quantisation and an acceptable accuracy loss.
- Export the quantised model for the target device.
While TensorFlow and PyTorch have made great progress in making things easy for quantisation, the Hugging Face team said the complexity of transformer-based models makes it hard to use the provided tools out of the box and get something working without putting up a ton of effort.
Citing Intel, Hugging Face showed how it solves quantisation and more with a low precision optimisation tool (LPOT). It is an open-source library designed to help users deploy low-precision inference solutions, particularly in deep learning models, to achieve optimal product objectives like inference performance and memory usage, etc.
This low-precision optimisation tool also supports post-training, quantisation-aware training and dynamic quantisation. To specify the quantisation approach, objective, and performance criteria, the user needs to provide a configuration YAML file specifying the tuning parameters.
The below-shown code shows how easily you can quantise transformers for Intel Xeon CPUs with Optimum:
The Hugging Face team said that Optimum will focus on achieving optimal production performance on dedicated hardware, where software and hardware methods are applied to increase efficiency. Further, they said they would collaborate with many hardware partners to enable, test, and maintain acceleration. In the coming months, Hugging Face said that it will be announcing its hardware partners.
“The collaboration with our hardware partners will yield hardware-specific optimised model configuration and artefacts, which we will make available to the AI community via the Hugging Face Model Hub,” said Hugging Face team.