Microsoft has recently introduced Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, and one which outperformed other state-of-the-art models on a variety of language modeling benchmarks.
T-NLG is a Transformer-based generative language model and is a part of the ongoing Turing project of Microsoft.
Late last year Microsoft announced the launch of Project Turing, which is aimed at enabling and today enables AI power search for enterprise.
The new Turing NLG model, according to the original post, can generate words to complete open-ended textual tasks and unfinished sentences. It can also, claims Microsoft, generate direct answers to questions and summaries of input documents.
Overview Of T-NLG
The team behind T-NLG emphasizes on the notion that bigger the model, the better it performs with fewer training examples.
Generative models are important for NLP tasks where the goal is to respond as accurately and fluently as humans can in any given situation.
With T-NLG, developers can summarize or answer questions about a personal document or email thread in a more natural way.
The team believes that it is more efficient to train a large centralized multi-task model rather than train a new model for every task individually.
T-NLG has been trained on the same type of data that Nvidia’s Megatron-LM models were trained on and has a maximum learning rate of 1.5×10^-4.
For a more efficient training of large models with fewer GPUs, Microsoft made use of DeepSpeed, trained on 256 NVIDIA GPUs compared to Megatron-LM ‘s 1024 NVIDIA GPUs.
Any model, observe the researchers, with more than a billion parameters cannot fit into a single GPU. So, the model itself must be parallelized across multiple GPUs.
Microsoft has released Deep Speed as an open-source library for large model training at improved scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models.
DeepSpeed is compatible with PyTorch and has a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism. These features were key in making breakthroughs to create Turing Natural Language Generation (Turing-NLG).
“To train a model with 20 billion parameters, DeepSpeed requires three times fewer resources.”
The resulting T-NLG model has 78 Transformer layers with a hidden size of 4256 and 28 attention heads.
Challenges Of Training Large Language Models
Training billions to trillions of parameters frequently runs up against fundamental hardware limitations:
- A model with more than 1 billion parameters runs out of memory even on GPUs with 32GB of memory. So,data parallelism does not help reduce memory footprint per device.
- Model parallelism does not scale efficiently due to expensive communication.
- Also model parallelism frameworks frequently require extensive code integration.
For example, the NVIDIA Megatron-LM with 8.3 billion parameters, scales very well for a model that fits in multiple GPUs of a single node, but performance degrades when scaling across nodes.
Empowering The Future Of Search
Microsoft’s Turing project, as discussed earlier, was aimed at enabling large scale, smart NLP based search at the enterprise level.
“Our goal is to more plainly satisfy users’ information needs by responding directly to their question.”
T-NLG has the ability to directly answer the question with a complete sentence, which is crucial to offline search. For example T-NLG can enable AI assistants to intelligently respond when a user asks a question about their personal data such as emails or word documents.
The researchers demonstrated this model’s performance on downstream tasks on 100,000 examples of “direct” answer question-passage-answer triples. T-NLG outperformed the LSTM baseline that was trained on multiple epochs of the same data.
State-of-the-art large models such as OpenAI GPT-2 and Google T5 have sizes of 1.5 billion and 11 billion parameters respectively. Microsoft’s ZeRO stage one in DeepSpeed provides system support to run models that are 10 times bigger, up to 100 billion parameters and with fewer resources.
Making large models to work with existing solutions is to make trade-offs between computation, communication, and development efficiency.
Since it is expensive to collect annotated supervised data, T-NLG’s success could be many profitable businesses.