Last updated January 28, 2022
In AI Origins & Evolution

Behind Meta’s claim of building world’s fastest AI Supercomputer

Meta has released the AI Research SuperCluster (RSC), calling it one of the fastest AI supercomputers running presently in the world.

Share

Published on January 28, 2022

by Sreejani Bhattacharyya

Meta has released the AI Research SuperCluster (RSC), calling it one of the fastest AI supercomputers running presently in the world. RSC will work across hundreds of different languages, analyse text, images and video together, which will help in building better AI models.

Mark Zuckerberg, while introducing RSC, said, “Meta has developed what we believe is the world’s fastest AI supercomputer. We’re calling it RSC for AI Research SuperCluster. The experiences we’re building for the metaverse require enormous compute power (quintillions of operations/second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more.”

https://twitter.com/MetaAI/status/1485658757245947914

RSC to play a key role in Metaverse

In order to understand the full benefits of self-supervised learning and transformer-based models, it requires training increasingly large, complex, and adaptable models. Speech recognition has to work effectively even in challenging scenarios that come with a lot of background noise. NLP has to understand more languages and dialects.

Meta said that RSC can train models that use multimodal signals to determine whether an action, sound or image is harmful or benign more quickly. It added that when RSC moves to the next phase, it will get even bigger with enhanced capabilities as the groundwork for metaverse is built. Meta’s researchers have already started using RSC for training large models in NLP and computer vision.

Research infrastructure from NVIDIA

Meta has collaborated with NVIDIA to build the AI Research Supercomputer. It uses 760 NVIDIA DGX A100 systems as its compute nodes. It comes with 6,080 NVIDIA A100 GPUs linked on an NVIDIA Quantum 200Gb/s InfiniBand network to give 1,895 petaflops of TF32 performance. Penguin Computing is the NVIDIA Partner Network delivery partner for RSC.

Penguin also provided managed services and AI-optimised infrastructure for Meta consisting of 46 petabytes of cache storage with its Altus systems. Pure Storage FlashBlade and FlashArray//C provide the scalable all-flash storage capabilities needed to boost the RSC.

Credit: NVIDIA

This is the second time NVIDIA has been the chosen partner for Meta as its base to provide research infrastructure. In 2017, Meta had built the first generation of infrastructure for AI research with 22,000 NVIDIA V100 Tensor Core GPUs. It had the capabilities of handling 35,000 AI training jobs in a day.

The early benchmarks of Meta have shown that RSC can train large NLP models three times faster and run computer vision jobs twenty times faster than the previous system. Later this year, in the second phase, RSC will expand to 16,000 GPUs. Meta thinks it will deliver five exaflops of mixed precision AI performance.

Privacy and security

Meta says that RSC has been built keeping privacy and security as prime focus areas.

RSC is isolated from the larger internet. It has no direct inbound or outbound connections with traffic flowing only from Meta’s production data centres.
The entire data path from the storage systems to the GPUs is end-to-end encrypted. It has the necessary tools and processes to verify that these requirements are met every time, Meta claims.
Before the data is imported to RSC, it goes through a privacy review process to confirm it has been correctly anonymised. After that, it is encrypted before it finds its usage in training AI models. The decryption keys are deleted regularly so that older data is not still accessible.

Access all our open Survey & Awards Nomination forms in one place