The deep learning community, especially those who work on Natural Language problems, had a great run in 2019. Top players like Google, NVIDIA and Microsoft have set new benchmarks with their every release. With time the models keep getting larger and the training times too, surprisingly, have somehow come down.
What really turned heads was NVIDIA’s world record for training state of the art BERT-Large models in just 47 minutes, which usually takes a week’s time.
This record was created by utilising 1,472 V100 SXM3-32GB 450W GPUs, 8 Mellanox Infiniband compute adapters per node, and running PyTorch with Automatic Mixed Precision to accelerate throughput.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
“Training with large batch sizes is a straightforward way to scale to multiple GPUs but numerical instability can pop up. That’s where our implementations of layer-wise adaptive optimizers such as LAMB can be of help”, said Swetha Mandava, a Senior Deep Learning Engineer at Nvidia.
Timeline of BERT training records:
- Amazon Web Services uses 8 NVIDIA V100 GPUs and reduces training time from several days to slightly over 60 minutes.
- Google AI claims that BERT training time can be reduced from 3 days to just 76 minutes by increasing the batch size to the memory limit of a TPUv3 Pod.
- NVIDIA uses 1,472 NVIDIA V100 GPUs and cuts down the typical training time for BERT-Large from several days to just 53 minutes.
Additionally, NVIDIA has also trained BERT-Large on just one NVIDIA DGX-2 system in 2.8 days.
How NVIDIA Set New World Record
BERT being large requires a massive amount of memory, and NVIDIA’s every DGX-2H node provides 0.5TB of bandwidth GPU memory for a total of 46TB.
The combination of computing power and high-bandwidth access to lots of DRAM along with GPUs makes the NVIDIA data centre platform optimal for BERT.
The table above demonstrates the efficient scaling as the number of nodes increases with time to train BERT-Large for various numbers of GPUs.
The developers at NVIDIA observe that having 8.3 billion parameters, resulted in a noticeable improvement in accuracy compared to smaller models.
However, the 8.3 billion parameter model begins to overfit after six epochs of training. This can be mitigated by moving to even larger scale problems and datasets.
To tackle the enormousness of the transformer models, NVIDIA opted for model parallelism. They have implemented intra-layer parallelism to reduce synchronisation and made a few targeted modifications to existing PyTorch transformer implementations to employ model parallelism.
Running on a DGX SuperPOD, 64 nodes achieved a 88% scaling efficiency versus 16 nodes on training BERT-Large.
NVIDIA’s hardware also played a crucial role in Microsoft’s latest mega NLP model, Turing NLG.
T-NLG has been trained on the same type of data that Nvidia’s Megatron-LM models were trained on and had a maximum learning rate of 1.5×10^-4.
In order to gain more efficient training of large ML models with lesser GPUs, Microsoft utilised DeepSpeed, trained on 256 NVIDIA GPUs compared to Megatron-LM ‘s 1024 NVIDIA GPUs.
Any model, observe the researchers, with more than a billion parameters, it cannot fit into a single GPU. So, the model itself must be parallelised across multiple GPUs.
Is Big Always Better?
OpenAI’s Mikhail Belkin and his peers who pioneered this work challenged the notion of the very popular claim that “bigger models are always better.”
Not adhering to the common notion that standard statistical machine learning theory predicting that bigger models should be more prone to overfitting, Belkin et al. in their seminal paper have discovered that the standard bias-variance tradeoff actually derails once it hits the “interpolation threshold”.
The above chart from OpenAI’s work on deep double descent, which illustrates how transformers trained on a language-translation task with no added label noise, move towards lower test error with an increasing number of samples.
However, we can also see a peak in test error to the right as increasing the number of samples also shifts the interpolation threshold. This work provided evidence for the existence of double descent phenomenon for a wide spectrum of models and datasets and brought some attention towards why overparameterization is not so straightforward.
Human-like language ability has for a better part of the century, remained an elusive goal for AI researchers. Models like BERT and GPT-2 (Generative Pretrained Transformer 2) have changed the way we deal with language understanding as they made rapid progress on difficult tasks.
These models have also proved to work on massive unlabeled datasets, which made them a hub of innovation for modern NLP and by extension a strong choice for the coming wave of intelligent assistants with conversational AI applications across many use cases.
Platforms like those of NVIDIA with its optimisations combined with software libraries provided a seamless end-to-end platform for developers.
According to Juniper Research, AI services powered by natural language understanding are expected to peak in this decade, and chatbots and assistants are anticipated to hit 8 billion within the next four years. With techniques such as those of NVIDIA accommodating to run larger models in shorter times, we can safely assume that we are at the verge of another significant NLP breakthrough!