NVIDIA Releases Eighth Generation Of Its Popular Conversational AI Software TensorRT

The latest version of TensorRT brings BERT-Large inference latency down to 1.2 milliseconds.

Share

Published on July 23, 2021

by Amit Raja Naik

NVIDIA recently released the eighth generation of its popular AI software TensorRT which cuts inference time in half for language queries — enabling developers to build the best-performing search engines, ad recommendations and chatbots and deliver them from the cloud to the edge.

TensorRT 8 is now generally available and free of charge to members of the NVIDIA developer programme. The latest versions of plug-ins, samples, and parsers are available on the TesorRT GitHub repository.

What’s new?

The latest version of TensorRT brings BERT-Large inference latency down to 1.2 milliseconds with new optimisations. BERT-Large is one of the world’s most widely used transformer-based models.

Further, it delivers 2x accuracy for INT8 precision with Quantization aware training and significantly higher performance through support for Sparsity.

Sparsity is a performance technique in NVIDIA Ampere GPUs to increase efficiency, allowing developers to boost their neural networks by reducing computational operations. On the other hand, Quantization aware training enables developers to use trained models to run inferences in INT8 precision without losing accuracy. This significantly reduces compute and storage overhead for efficient inference on Tensor Cores.

The latest release of high performance deep learning inference SDK, TensorRT 8, includes:

BERT Inference in 1.2 milliseconds with new transformer optimisations
Achieves accuracy equivalent to FP32 with INT8 precision using Quantization aware training
Supports Sparsity for faster inference on Ampere GPUs

TensorRT is an SDK for ‘high-performance deep learning inference.’ It includes a deep learning inference optimiser and runtime that delivers low latency and high throughput for ‘deep learning inference applications.’ ‘TensorRT-based applications’ perform up to 40x faster than ‘CPU-only platforms’ during inference.

In the last five years, TensorRT has been downloaded nearly 2.5 million times and used by more than 350K developers across 27.5K companies in wide-ranging industries, including healthcare, automotive, finance and retail, among others. In addition, it can be deployed in hyper-scale data centres, embedded or automotive product platforms.

With eight-generation TensorRT, companies can double or triple their model size to achieve dramatic improvements in accuracy.

TensorRT impact on businesses

According to Tractica, the global AI market is expected to touch a revenue of $118 billion by 2025. As per Gartner, the percentage of enterprises employing AI grew 270% over the last four years. Nearly 95% of customer interactions will be powered by artificial intelligence, according to Servion Global Solutions.

Greg Estes, vice president of developer programmes at NVIDIA, said it becomes imperative for enterprises to deploy SOTA inference solutions as AI models are becoming more complex. Globally, the demand for real-time applications that use AI is rising significantly.

“The latest version of ‘TensorRT’ introduces new capabilities that enable companies to deliver ‘conversational AI applications’ to their customers with a level of quality and responsiveness that was never before possible,” said Estes.

Today, many companies are embracing TensorRT for their deep learning inference applications in conversational AI. American Express, GE Healthcare, Ford, IBM and others are some of the leading adopters of TensorRT.

Jeff Boudier, product director at Hugging Face, said they collaborated with NVIDIA to deliver the best possible performance for SOTA models on GPUs. Hugging Face Accelerated Inference API delivers up to 100x speedup for transformer models powered by NVIDIA GPUs. “With TensorRT 8, Hugging Face achieved 1-millisecond inference latency on BERT, and we are excited to offer this performance to our customers later this year,” added Boudier.

Hugging Face is an open-source AI community platform closely working with NVIDIA to introduce groundbreaking AI services that enable text analysis, neural search and conversational applications at scale.

Chief engineer at Cardiovascular Ultrasound at GE Healthcare, Erik Steen, said when it comes to ultrasound, clinicians spend valuable time selecting and measuring images. So their R&D team was looking at ways to efficiently implement automated cardiac view detection on their Vivid E95 scanner.

“The cardiac view recognition algorithm selects appropriate images for analysis of cardiac wall motion. TensorRT, with its real-time inference capabilities, improves the performance of the view detection algorithm, and it also shortened our time to market during the R&D project,” added Steen.

Pioneer in global medical technology GE Healthcare is currently using TensorRT to help accelerate computer vision applications for ultrasounds. It enables clinicians and practitioners to deliver the highest quality of care through its intelligent healthcare solutions.

Finally, one of China’s biggest social media platforms, WeChat, is accelerating its search capabilities using TensorRT. “The conventional limitation of NLP model complexity has been broken through by our solution with GPU + TesnorRT and BERT/transformer can be fully integrated into our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimisation methods,” according to WeChat Search team.

Access all our open Survey & Awards Nomination forms in one place

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.