Last updated May 6, 2024
In AI News & Update

PyTorch Enables Llama 2 & 3 to Run on Smartphones with Zero Code

The update boosts performance for small batch size inference, achieving up to 1.94 times faster speeds compared to the original

Share

Published on May 3, 2024

by K L Krithika

Keeping up with the focus on delivering AI models to edge devices, the PyTorch team has made it possible to run Llama 2 and 3 models on smartphones without requiring any coding announced on 1st May.

The researchers presented an optimised Triton FP8 GEMM (General Matrix-Matrix Multiply) kernel TK-GEMM that leverages SplitK parallelisation.

This enhancement boosts performance for small batch size inference, achieving up to 1.94 times faster speeds compared to the original Triton setup, 1.87 times faster than cuBLAS FP8, and 1.71 times faster than cuBLAS FP16 for Llama3-70B tasks on NVIDIA H100 GPUs.

The SplitK parallelisation means creating more work units along the k dimension, which breaks down tasks into smaller pieces and reduces delays, especially for matrices with smaller M values.

Additionally, leveraging CUDA graphs reduces CPU launch overhead, resulting in up to 6.4x speedup for a single attention layer in Llama3-70B models. These optimisations demonstrate significant performance gains and pave the way for further enhancements in FP8 inference.

With the ability to run Llama models on mobile devices, developers can now create apps that harness the power of these advanced language models without the need for extensive coding knowledge. Developers can create intelligent virtual assistants, personalised language learning apps, real-time translation tools, and much more.

The updates also come with instructions for running Llama 2 and 3 on both iOS and Android devices.

The PyTorch team has utilised the new FP8 datatype, introduced jointly by Nvidia, Arm, and Intel, which serves as a successor to 16-bit floating point types. The FP8 data type consists of two formats: E4M3 and E5M2, which provide significant throughput improvements over their predecessors for Transformer networks.

They have also identified potential future optimisation paths, such as leveraging the TMA (Tensor Memory Accelerator) hardware unit and improving Tensor Core utilisation. These optimisations could lead to even greater performance gains in the future.

Access all our open Survey & Awards Nomination forms in one place