Last updated February 12, 2020
In AI Origins & Evolution

Microsoft’s Turing NLG: The Largest Language Model Ever Is Released

Share

Published on February 12, 2020

by Ram Sagar

Microsoft has recently introduced Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, and one which outperformed other state-of-the-art models on a variety of language modeling benchmarks.

T-NLG is a Transformer-based generative language model and is a part of the ongoing Turing project of Microsoft.

Late last year Microsoft announced the launch of Project Turing, which is aimed at enabling and today enables AI power search for enterprise.

The new Turing NLG model, according to the original post, can generate words to complete open-ended textual tasks and unfinished sentences. It can also, claims Microsoft, generate direct answers to questions and summaries of input documents.

Overview Of T-NLG

Source: Microsoft

The team behind T-NLG emphasizes on the notion that bigger the model, the better it performs with fewer training examples.

Generative models are important for NLP tasks where the goal is to respond as accurately and fluently as humans can in any given situation.

With T-NLG, developers can summarize or answer questions about a personal document or email thread in a more natural way.

The team believes that it is more efficient to train a large centralized multi-task model rather than train a new model for every task individually.

T-NLG has been trained on the same type of data that Nvidia’s Megatron-LM models were trained on and has a maximum learning rate of 1.5×10^-4.

For a more efficient training of large models with fewer GPUs, Microsoft made use of DeepSpeed, trained on 256 NVIDIA GPUs compared to Megatron-LM ‘s 1024 NVIDIA GPUs.

Any model, observe the researchers, with more than a billion parameters cannot fit into a single GPU. So, the model itself must be parallelized across multiple GPUs.

Microsoft has released Deep Speed as an open-source library for large model training at improved scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models.

Source: Microsoft

DeepSpeed is compatible with PyTorch and has a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism. These features were key in making breakthroughs to create Turing Natural Language Generation (Turing-NLG).

“To train a model with 20 billion parameters, DeepSpeed requires three times fewer resources.”

The resulting T-NLG model has 78 Transformer layers with a hidden size of 4256 and 28 attention heads.

How to train a model with 10^11 parameters without running out of GPU memory?
Use DeepSpeed from Microsoft Research!
It's PyTorch compatible.
It partitions the network onto multiple processors automatically and efficiently. https://t.co/tudQvkqZtX
— Yann LeCun (@ylecun) February 10, 2020

Challenges Of Training Large Language Models

Training billions to trillions of parameters frequently runs up against fundamental hardware limitations:

A model with more than 1 billion parameters runs out of memory even on GPUs with 32GB of memory. So,data parallelism does not help reduce memory footprint per device.

Model parallelism does not scale efficiently due to expensive communication.

Also model parallelism frameworks frequently require extensive code integration.

For example, the NVIDIA Megatron-LM with 8.3 billion parameters, scales very well for a model that fits in multiple GPUs of a single node, but performance degrades when scaling across nodes.

Empowering The Future Of Search

Microsoft’s Turing project, as discussed earlier, was aimed at enabling large scale, smart NLP based search at the enterprise level.

“Our goal is to more plainly satisfy users’ information needs by responding directly to their question.”

T-NLG has the ability to directly answer the question with a complete sentence, which is crucial to offline search. For example T-NLG can enable AI assistants to intelligently respond when a user asks a question about their personal data such as emails or word documents.

The researchers demonstrated this model’s performance on downstream tasks on 100,000 examples of “direct” answer question-passage-answer triples. T-NLG outperformed the LSTM baseline that was trained on multiple epochs of the same data.

State-of-the-art large models such as OpenAI GPT-2 and Google T5 have sizes of 1.5 billion and 11 billion parameters respectively. Microsoft’s ZeRO stage one in DeepSpeed provides system support to run models that are 10 times bigger, up to 100 billion parameters and with fewer resources.

Making large models to work with existing solutions is to make trade-offs between computation, communication, and development efficiency.

Since it is expensive to collect annotated supervised data, T-NLG’s success could be many profitable businesses.

Access all our open Survey & Awards Nomination forms in one place

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Guide to Salesforce’s CTRL: Conditional Transformer Language Model

Rajkumar Lakshmanamoorthy 16/03/2021

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

‘iPhone is the Greatest Piece of Technology Humanity has Ever Made,’ Says OpenAI’s Sam Altman

Siddharth Jindal

“There are a bunch of societal and interpersonal issues that are all very complicated about wearing a computer on your face,” says OpenAI chief, taking a dig at Meta smart