MITB Banner

NVIDIA TensorRT-LLM Updates Boost Inference on H200 GPUs

These enhancements showcase a remarkable 6.7x speedup for the Llama 2 70B LLM and Falcon-180B to run on a single GPU.

Share

NVIDIA TensorRT-LLM Updates Boost Inference on H200 GPUs
Listen to this story

NVIDIA TensorRT-LLM has introduced optimisations for peak throughput and memory efficiency, resulting in significant enhancements in LLM inference performance. The latest TensorRT-LLM improvements on NVIDIA H200 GPUs showcase a remarkable 6.7x speedup for the Llama 2 70B LLM. 

Notably, these enhancements also enable the efficient operation of massive models, such as Falcon-180B, on a single GPU, a task that previously necessitated a minimum of eight NVIDIA A100 Tensor Core GPUs. 

The acceleration of Llama 2 70B is attributed to the optimisation of Grouped Query Attention (GQA), an extension of multi-head attention techniques, particularly crucial in the Llama 2 70B architecture.

The evaluation of Llama 2 70B performance, considering different input and output sequence lengths, demonstrates the impressive throughput achieved by H200. As the output sequence length increases, raw throughput decreases, but the performance speedup compared to A100 increases significantly. 

Additionally, software improvements in TensorRT-LLM alone contribute to a 2.4x improvement compared to the previous version running on H200.

Falcon-180B, known for its size and accuracy, historically demanded eight NVIDIA A100 Tensor Core GPUs for execution. However, the latest TensorRT-LLM advancements, incorporating a custom INT4 AWQ, empower the model to run seamlessly on a single H200 Tensor Core GPU. This GPU boasts 141 GB of cutting-edge HBM3e memory with nearly 5 TB/s of memory bandwidth.

The latest TensorRT-LLM release implements custom kernels for AWQ, performing computations in FP8 precision on NVIDIA Hopper GPUs, utilizing the latest Hopper Tensor Core technology. This enables the entire Falcon-180B model to run efficiently on a single H200 GPU with an impressive inference throughput of up to 800 tokens/second.

In terms of performance, the TensorRT-LLM software improvements alone contribute to a 2.4x enhancement compared to the previous version running on H200.

The custom implementation of Multi-Head Attention (MHA) that supports GQA, Multi-Query Attention (MQA), and standard MHA leverages NVIDIA Tensor Cores during the generation and context phases, ensuring optimal performance on NVIDIA GPUs.

Despite the reduction in memory footprint, TensorRT-LLM AWQ maintains accuracy above 95%, demonstrating its efficiency in optimizing GPU compute resources and reducing operational costs.

These advancements are set to be incorporated into upcoming releases (v0.7 and v0.8) of TensorRT-LLM.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.