Listen to this story
Today, NVIDIA announced that they are engaging in a multi-year collaboration with Microsoft to build one of the most powerful AI supercomputers in the world. The undertaking will utilise Microsoft Azure’s advanced supercomputing infrastructure along with tens of thousands of NVIDIA GPUs.
The array will comprise of NVIDIA’s A100 and H100 GPUs, along with their Quantum-2 400Gb/s Infiniband networking solution. Most importantly, this will be the first public cloud that includes NVIDIA’s advanced AI tech stack, which will enable companies to train and deploy AI at scale.
The collaboration is interesting for a multitude of reasons, but primarily for opening up one of the biggest clusters of compute power to enterprises. This will not only allow organisations to train and deploy AI at a scale that was previously prohibitively expensive to do, but also let them do it at a high rate of efficiency due to optimisations introduced in the latest ‘Hopper’ generation of NVIDIA GPUs.
Sign up for your weekly dose of what's up in emerging technology.
The Hopper generation utilises a technology known as the ‘Transformer Engine’ to accelerate AI workloads. It executes this by using custom Tensor Cores to convert existing FP16 and FP32 operations to FP8 operations on the fly. In addition to this, they have also implemented other optimisations such as better memory bandwidth utilisation and an improved interconnect technology.
Secondarily, the cluster will leverage Microsoft’s DeepSpeed deep learning optimization software. DeepSpeed is a software suite that allows for scaling and speeding up deep learning training and inference tasks. It allows developers to not only scale to thousands of GPUs, but also enables them to run heavy tasks on resource-constrained GPU systems.
Download our Mobile App
As part of their partnership, Microsoft will work with NVIDIA to further optimise the GPUs to work better with DeepSpeed. They will also research advances in generative AI to train and deploy unsupervised, self-learning models like the Megatron Turing NLG 530B. Azure’s customer base will gain access to NVIDIA’s full stack of AI workflows and SDKs, which will be optimised to run on Azure.
DeepSpeed achieves this through innovations in training such as ZeRO and Parallelism, and backs it up with high performance custom kernels for inference and compression techniques that reduce system load.
Finally, access to AI training resources on this scale have not been opened up to enterprises on the level of Microsoft Azure. This is especially fitting, considering the current trend researching AI innovations by companies all over the world.
Even though NVIDIA has an in-house supercomputer, the partnership shows that they have taken notice of the behemoth computing requirements that modern algorithms require.
A study has found that the compute requirements for large-scale AI models have been doubling every 10.7 months between 2016 and 2022. Leveraging Azure’s architecture and scalability will not only allow them to serve a larger number of enterprise clients, but is also likely to drive AI innovation for existing customers.
This advancement marks a partnership between two of the biggest companies in the AI space. Microsoft is no stranger to this field, as seen by their partnership with OpenAI and their commitment to developing safe and responsible AI. NVIDIA, on the other hand, has been one of the cornerstones of AI research and development for the past decade due to their powerful GPUs and accompanying tech stack such as CUDA and Tensor cores.