MITB Banner

Meta is Training Llama 3 on 24k NVIDIA H100 Clusters

Meta’s goal is to further expand its infrastructure footprint, encompassing 350,000 NVIDIA H100s.

Share

Meta is Training Llama 3 on 24k NVIDIA H100 Clusters
Listen to this story

Meta has unveiled crucial details about its cutting-edge hardware infrastructure, specifically tailored for AI training and as Yann LeCun pointed out for training Llama 3. The company revealed insights into its 24,576-GPU data centre-scale clusters, integral to supporting current and forthcoming AI models, including Llama 3, the successor to Llama 2.

Representing a significant investment in AI hardware, Meta’s clusters underscore the pivotal role of infrastructure in shaping the future of AI. These clusters are designed to power Meta’s long-term vision of creating AGI in an open and responsible manner, aiming for widespread accessibility.

In the latest development, Meta has deployed two variants of its 24,576-GPU clusters, each equipped with distinct network fabric solutions. One cluster utilises a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric, while the other features an NVIDIA Quantum2 InfiniBand fabric. Both solutions boast 400 Gbps endpoints, enabling seamless interconnectivity for large-scale training tasks.

Notably, the company’s AI Research SuperCluster (RSC), introduced in 2022, featuring 16,000 NVIDIA A100 GPUs, has been pivotal in advancing open and responsible AI research, facilitating the development of advanced AI models such as Llama and Llama 2.

Through meticulous network, software, and model architecture co-design, Meta has successfully harnessed the capabilities of both RoCE and InfiniBand clusters, mitigating network bottlenecks in large-scale AI workloads. This includes ongoing training sessions of Llama 3 on Meta’s RoCE cluster, demonstrating the effectiveness of the infrastructure in supporting advanced AI training tasks.

Looking ahead to the end of 2024, Meta’s goal is to further expand its infrastructure footprint, encompassing 350,000 NVIDIA H100s. This expansion is part of a comprehensive portfolio initiative aimed at achieving computational capabilities equivalent to nearly 600,000 H100s.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.