Listen to this story
Two years back, NYU professors Gary Marcus and Ernest Davis published an article in MIT Technology Review on GPT-3. The authors asked GPT-3 a series of questions to expose its poor grasp of reality: “Yesterday I dropped my clothes off at the dry cleaner’s, and I have yet to pick them up. Where are my clothes?” GPT-3 replied, “I have a lot of clothes.”
Clearly, large language models like GPT-3 are not good at multi-step reasoning. “Fundamentally, language is about relating sentences that you hear, and systems like GPT-3 never do that. You give them all the data in the world, and they are still not deriving the notion that language is about semantics,” said Gary Marcus. So the question is–how do we enable language models to perform reasoning tasks?
In a recent paper, “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” Google introduced ‘chain of thought prompting’ to improve the reasoning abilities of language models. The method enables models to decompose multi-step problems into intermediate steps. The technique works on language models with more than 100 billion parameters.
Today, models like GPT-3 use the standard prompting method. The model is given input-output examples and asked to predict the answer for a test-time example. In comparison, the chain of thought prompting nudges the model to produce intermediate reasoning steps before giving the final answer to a multi-step problem. Model-generated chain of thought tries to mimic an intuitive thought process. Further, Google said a thought process could be elicited by including examples of the chain of thought via prompting.
In the chain of thought reasoning, models decompose complex problems into intermediate steps that are solved individually— and the approach is language-based. Google researchers showed the method could improve performance on various reasoning tasks.
The method follows how humans naturally deliberate when presented with a multi-step reasoning problem. Google envisions language models to generate a coherent chain of thought analogously before arriving at the answer. This helps improve performance across various reasoning tasks where standard few-shot prompting is insufficient and achieves better results when combined with language models at scale.
The chain of thought prompting:
1. Allows models to decompose multi-step problems into intermediate steps, allowing additional computation to be allocated to problems requiring more reasoning steps.
2. Provides an interpretable window into the model’s behaviour to understand how it may have arrived at a particular answer. This allows developers to debug where the reasoning path went wrong
3. Works for math word problems, symbolic manipulation, and commonsense reasoning. In principle, it applies to any task that humans can solve via language.
4. Can be readily elicited in sufficiently large off-the-shelf language models by including examples of chain of thought sequences into the exemplars of few-shot prompting.
Google has tested chain of thought prompting on its Dialogue Applied Language Model (LaMDA) and the Pathway Language Model (PaLM).
Google’s Pathways Language Model (PaLM) is built on the Pathways architecture. With Pathways, Google Research’s end game is to build a single model that could generalise across domains and tasks while being highly efficient. PaLM achieved state-of-the-art few-shot performance across hundreds of language understanding and generation tasks.
In addition, the model demonstrated exceptional natural language understanding and generation capabilities on several BIG-bench tasks.
Google’s LaMDA aims to ensure a smooth conversing experience and more meaningful and alive conversations. Alphabet’s CEO, Sunder Pichai, said he spent some time with his son conversing with LaMDA (masqueraded as Pluto, the planet). It was magical as the model talked about the New Horizons spacecraft and the coldness of space, he added. Google said the architecture produces a model that can be trained to read many words, work on how they relate to each other and predict what word comes next.
LaMDA and PaLM were tested on two benchmarks in arithmetic reasoning–MultiArith and GSM8K–to evaluate their ability to solve multi-step math problems. The researchers came up with chains of thought to include in the examples for chain of thought prompting. The chain of thought prompting method improved model performance, outperforming standard prompting for large model sizes.
Google evaluated the models on CommonsenseQA and StrategyQA benchmarks. Model performance improved with scale, and the prompting led to small improvements. For instance, when asked: Is the following sentence plausible? “Joao Moutinho caught the screen pass in the NFC championship”, the model responded- Joao Moutinho is a soccer player. The NFC championship is part of American football, not soccer. So the answer is no.
Here’s another example of the chain of thought prompting for symbolic reasoning datasets:
Q: A coin is heads up. Maybelle flips the coin. Shalonda does not flip the coin. Is the coin still heads up? A: The coin was flipped by Maybelle. So the coin was flipped 1 time, which is an odd number. The coin started heads up, so after an odd number of flips, it will be tails up. So the answer is no.
The arithmetic and commonsense reasoning experiments led the researchers to conclude that chain of thought prompting is an emergent property of model scale.