Recently, researchers from UC Berkeley, University of Maryland and UC Irvine showed the accuracy of the World’s largest language model, GPT-3 can be highly unstable across different prompts. They also developed a contextual calibration method, which improves the performance and accuracy of GPT-3 by up to 30%.
GPT-3 by OpenAI broke new grounds in natural language processing (NLP). From creating a fake blog to posting Reddit comments and roasting Elon Musk’s tweets, the autoregressive language model with 175 billion parameters has shown its immense potential.
Why This Research
Few-shot learning is a crucial aspect of artificial intelligence. It is the ability to learn tasks with limited sources and examples. Language models like GPT-3 can perform numerous tasks when provided a few examples in a natural language prompt. GPT-3 follows a few-shot “in-context” learning, meaning the model can learn without parameter updates.
Few-shot learning has several practical advantages over the standard approach of fine-tuning:
- Few-shot learning allows practitioners to prototype NLP models rapidly
- It provides a fully natural language interface to a machine learning model, allowing users to create NLP systems without any technical expertise in the field.
- Since in-context learning reuses the same model for each task, few-shot learning reduces the memory requirements and system complexity while doing different tasks.
However, despite the numerous advantages, language models like GPT-3 still can be highly unstable across different prompts. A prompt contains three components:
- A format
- A set of training examples
- A permutation (ordering) for those examples
The researchers experimented on three sizes of GPT-3, including 2.7B, 13B, and 175 Billion parameters and GPT-2 with 1.5 Billion parameters. The findings showed the accuracy of GPT-3 varies across different training examples, permutations, and prompt formats.
Firstly, the accuracy of GPT-3 depends highly on both the selection and permutations of the training examples. In this case, the researchers used a fixed prompt format and chose different random sets of training examples. For each set of training examples, they evaluated the accuracy for all possible permutations.
Secondly, the accuracy depends highly on the prompt format. The researchers kept the set of training examples and permutations fixed but varied the prompt format. The formats include question-answer templates, conversation templates, prompts that resemble Web pages, and variations on the label name.
While analysing why GPT-3’s accuracy varies across different training examples, permutations, and prompt formats, the researchers found the variance arises because language models are biased towards outputting answers that are-
- frequent in the prompt (majority label bias),
- towards the end of the prompt (recency bias), and
- common in the pre-training data (common token bias).
The Tech Behind
The researchers introduced contextual calibration — a simple method that makes language models better few-shot learners.
The effectiveness of contextual calibration was evaluated across all datasets and language models. The researchers found the method improved the accuracy by up to 30%, reduced variance, and made tools like GPT-3 and GPT-2 more effective.
The researchers used datasets for three main tasks: text classification, fact retrieval, and information extraction. They used a fixed prompt format for each dataset. The text classification was studied using six datasets-
- Sentiment analysis using SST-2
- 6-way question classification using TREC
- Textual entailment using 3-way CB
- Binary RTE from SuperGLUE
- Topic classification using the 4-way AGNews
- 14-way DBPedia dataset
The fact retrieval task was evaluated with LAMA dataset. The dataset consists of knowledge base triples that are placed into templates with missing objects. Furthermore, the researchers considered information extraction using two slot filling datasets, ATIS and MIT Movies trivia10k13 datasets.
Read the paper here.