PyTorch Enables Llama 2 & 3 to Run on Smartphones with Zero Code

The update boosts performance for small batch size inference, achieving up to 1.94 times faster speeds compared to the original 

Share

Keeping up with the focus on delivering AI models to edge devices, the PyTorch team has made it possible to run Llama 2 and 3 models on smartphones without requiring any coding announced on 1st May. 

The researchers presented an optimised Triton FP8 GEMM (General Matrix-Matrix Multiply) kernel TK-GEMM that leverages SplitK parallelisation.

This enhancement boosts performance for small batch size inference, achieving up to 1.94 times faster speeds compared to the original Triton setup, 1.87 times faster than cuBLAS FP8, and 1.71 times faster than cuBLAS FP16 for Llama3-70B tasks on NVIDIA H100 GPUs.

The SplitK parallelisation means creating more work units along the k dimension, which breaks down tasks into smaller pieces and reduces delays, especially for matrices with smaller M values.

Additionally, leveraging CUDA graphs reduces CPU launch overhead, resulting in up to 6.4x speedup for a single attention layer in Llama3-70B models. These optimisations demonstrate significant performance gains and pave the way for further enhancements in FP8 inference.

With the ability to run Llama models on mobile devices, developers can now create apps that harness the power of these advanced language models without the need for extensive coding knowledge. Developers can create intelligent virtual assistants, personalised language learning apps, real-time translation tools, and much more.

The updates also come with instructions for running Llama 2 and 3 on both iOS and Android devices. 

The PyTorch team has utilised the new FP8 datatype, introduced jointly by Nvidia, Arm, and Intel, which serves as a successor to 16-bit floating point types. The FP8 data type consists of two formats: E4M3 and E5M2, which provide significant throughput improvements over their predecessors for Transformer networks.

They have also identified potential future optimisation paths, such as leveraging the TMA (Tensor Memory Accelerator) hardware unit and improving Tensor Core utilisation. These optimisations could lead to even greater performance gains in the future.

Share
Picture of K L Krithika

K L Krithika

K L Krithika is a tech journalist at AIM. Apart from writing tech news, she enjoys reading sci-fi and pondering the impossible technologies, trying not to confuse it with reality.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India