OpenAI researchers released a paper describing the development of GPT-3, a state-of-the-art language model made up of 175 billion parameters.
The previous OpenAI GPT model had 1.5 billion parameters and was the biggest model back then, which was soon eclipsed by NVIDIA’s Megatron, with 8 billion parameters followed by Microsoft’s Turing NLG that had 17 billion parameters. Now, OpenAI turns the tables by releasing a model that is 10x larger than Turing NLG.
Current NLP systems still largely struggle to learn from a few examples. With GPT-3, the researchers show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
Natural language processing tasks range from generating news articles to language translation and answering standardised test questions.
The researchers trained 8 different sizes of model ranging from 125 million parameters to 175 billion parameters, with the last being GPT-3.
How GPT-3 Pipped Other Models
For GPT-3, the OpenAI team used the same model and architecture as GPT-2 that includes modified initialisation, pre-normalisation, and reversible tokenisation along with alternating dense and locally banded sparse attention patterns in the layers of the transformer.
The researchers state that larger models make increasingly efficient use of in-context information. As can be seen in the plot above, the steeper “in-context learning curves” for large models show improved ability to learn from contextual information.
For training, the researchers have used a combination of model parallelism within each matrix multiply and model parallelism.
GPT-3 was trained on V100 GPU’s on the part of a high-bandwidth cluster provided by Microsoft.
Evaluation of GPT-3 is done under 3 conditions:
- few-shot learning
- one-shot learning
- zero-shot learning
GPT-3 achieved promising results in the zero-shot and one-shot settings, and in the few-shot setting, occasionally surpassed state-of-the-art models.
The results show that GPT-3 showed strong performance with translation, question-answering, and cloze tasks, as well as with unscrambling words and performing 3-digit arithmetic. The researchers claim that GPT-3 can even generate news articles which human evaluators have difficulty distinguishing from articles written by humans.
GPT-3 is an incredibly large model, and one cannot expect to build something like this without fancy computational resources. However, the researchers assure that these models can be efficient once trained, where even a full GPT-3 model generating 100 pages of content from a trained model can cost only a few cents in energy costs.
Where Can This Go Wrong
“GPT-3 has the potential to advance both the beneficial and harmful applications of language models.”OpenAI researchers
In an unprecedented approach, the researchers go in detail about the harmful effects of GPT-3 in their paper. The high-quality text generating capability of GPT-3 can make it difficult to distinguish synthetic text from the human-written text, so the authors warn that there can be a misuse of language models. They admit that malicious uses of language models can be difficult to anticipate because language models can be repurposed in a very different environment or for a different purpose than what the researchers intended.
They list the following misuses:
- Spam & phishing
- Fraudulent academic essay writing
- Abuse of legal and governmental processes and
- social engineering pretexting
Since GPT-3 scraped almost everything on the internet and every word written, the researchers had an opportunity to identify how the racial sentiments and other sentiments play out in conversations. For example, with the religion of Islam, they have found that words such as violent, terrorism and terrorist co-occurred at a greater rate than with other religions.
Despite many limitations and weaknesses, the researchers conclude that very large language models may be an important ingredient in the development of adaptable, general language systems.
Read the full paper here.