NVIDIA recently released the eighth generation of its popular AI software TensorRT which cuts inference time in half for language queries — enabling developers to build the best-performing search engines, ad recommendations and chatbots and deliver them from the cloud to the edge.
TensorRT 8 is now generally available and free of charge to members of the NVIDIA developer programme. The latest versions of plug-ins, samples, and parsers are available on the TesorRT GitHub repository.
The latest version of TensorRT brings BERT-Large inference latency down to 1.2 milliseconds with new optimisations. BERT-Large is one of the world’s most widely used transformer-based models.
Sign up for your weekly dose of what's up in emerging technology.
Sparsity is a performance technique in NVIDIA Ampere GPUs to increase efficiency, allowing developers to boost their neural networks by reducing computational operations. On the other hand, Quantization aware training enables developers to use trained models to run inferences in INT8 precision without losing accuracy. This significantly reduces compute and storage overhead for efficient inference on Tensor Cores.
The latest release of high performance deep learning inference SDK, TensorRT 8, includes:
- BERT Inference in 1.2 milliseconds with new transformer optimisations
- Achieves accuracy equivalent to FP32 with INT8 precision using Quantization aware training
- Supports Sparsity for faster inference on Ampere GPUs
TensorRT is an SDK for ‘high-performance deep learning inference.’ It includes a deep learning inference optimiser and runtime that delivers low latency and high throughput for ‘deep learning inference applications.’ ‘TensorRT-based applications’ perform up to 40x faster than ‘CPU-only platforms’ during inference.
In the last five years, TensorRT has been downloaded nearly 2.5 million times and used by more than 350K developers across 27.5K companies in wide-ranging industries, including healthcare, automotive, finance and retail, among others. In addition, it can be deployed in hyper-scale data centres, embedded or automotive product platforms.
With eight-generation TensorRT, companies can double or triple their model size to achieve dramatic improvements in accuracy.
TensorRT impact on businesses
According to Tractica, the global AI market is expected to touch a revenue of $118 billion by 2025. As per Gartner, the percentage of enterprises employing AI grew 270% over the last four years. Nearly 95% of customer interactions will be powered by artificial intelligence, according to Servion Global Solutions.
Greg Estes, vice president of developer programmes at NVIDIA, said it becomes imperative for enterprises to deploy SOTA inference solutions as AI models are becoming more complex. Globally, the demand for real-time applications that use AI is rising significantly.
“The latest version of ‘TensorRT’ introduces new capabilities that enable companies to deliver ‘conversational AI applications’ to their customers with a level of quality and responsiveness that was never before possible,” said Estes.
Today, many companies are embracing TensorRT for their deep learning inference applications in conversational AI. American Express, GE Healthcare, Ford, IBM and others are some of the leading adopters of TensorRT.
Jeff Boudier, product director at Hugging Face, said they collaborated with NVIDIA to deliver the best possible performance for SOTA models on GPUs. Hugging Face Accelerated Inference API delivers up to 100x speedup for transformer models powered by NVIDIA GPUs. “With TensorRT 8, Hugging Face achieved 1-millisecond inference latency on BERT, and we are excited to offer this performance to our customers later this year,” added Boudier.
Hugging Face is an open-source AI community platform closely working with NVIDIA to introduce groundbreaking AI services that enable text analysis, neural search and conversational applications at scale.
Chief engineer at Cardiovascular Ultrasound at GE Healthcare, Erik Steen, said when it comes to ultrasound, clinicians spend valuable time selecting and measuring images. So their R&D team was looking at ways to efficiently implement automated cardiac view detection on their Vivid E95 scanner.
“The cardiac view recognition algorithm selects appropriate images for analysis of cardiac wall motion. TensorRT, with its real-time inference capabilities, improves the performance of the view detection algorithm, and it also shortened our time to market during the R&D project,” added Steen.
Pioneer in global medical technology GE Healthcare is currently using TensorRT to help accelerate computer vision applications for ultrasounds. It enables clinicians and practitioners to deliver the highest quality of care through its intelligent healthcare solutions.
Finally, one of China’s biggest social media platforms, WeChat, is accelerating its search capabilities using TensorRT. “The conventional limitation of NLP model complexity has been broken through by our solution with GPU + TesnorRT and BERT/transformer can be fully integrated into our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimisation methods,” according to WeChat Search team.