Last updated May 1, 2024
In AI News & Update

PyTorch Releases ExecuTorch Alpha for Deploying LLMs for Edge Devices

The cutting-edge tool is designed to deploy LLMs on devices like smartphones and smart glasses.

Share

Published on May 1, 2024

by K L Krithika

PyTorch yesterday announced the release of ExecuTorch alpha, a new tool focused on deploying large language models and large ML models to edge devices. The release, which comes just a few months after the 0.1 preview in collaboration with partners at Arm, Apple, and Qualcomm Technologies, Inc., aims to stabilise the API surface and improve installation processes.

ExecuTorch alpha brings several key features that allow running LLMs efficiently on mobile devices, which are highly constrained for compute, memory, and power. It supports 4-bit post-training quantisation using GPTQ and provides broad device support on CPU through dynamic shape support and new dtypes in XNNPack.

These improvements allow running models like Llama 2 7B and early support for Llama 3 8B on various edge devices, including iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22, S23, and S24 phones.

The release also expands the list of supported models across NLP, vision, and speech, with traditional models expected to function seamlessly out of the box. The ExecuTorch SDK has been enhanced with better debugging and profiling tools, allowing developers to map from operator nodes back to original Python source code for efficient anomaly resolution and performance tuning.

PyTorch’s collaborations with partners such as Arm, Apple, Qualcomm Technologies, Google, and MediaTek have been crucial in bringing ExecuTorch to fruition. The framework has already seen production usage, with Meta using it for hand tracking on Meta Quest 3, various models on Ray-Ban Meta Smart Glasses, and integration with Instagram and other Meta products.

Recently PyTorch released 2.3 introducing several features and improvements for performance and usability of large language models and sparse inference. The release allows tensor manipulations across GPUs and hosts, integrating with FSDP (Fully Sharded Data Parallel) for efficient 2D parallelism.

Access all our open Survey & Awards Nomination forms in one place