In 2016, OpenAI published a blog post, ‘Faulty Reward Functions in the Wild’, discussing an AI model that got creative and found a ‘counterintuitive’ way to optimise and reach its goal. The company realised the need to design a safe AI system to avoid misinterpretation of the specified goals.
Brian Christian’s book, The Alignment Problem, talks about this problem- to ensure that ML models act per human intentions. Now, OpenAI has introduced its tool for a scalable solution to the alignment problem. According to the OpenAI team, it “needs to work on tasks where model outputs are difficult or time-consuming for humans to evaluate.” The team tested this model to summarise an entire book to demonstrate this solution.
Introducing the tool
OpenAI’s tool combines recursive task decomposition and learning from human feedback. The model is initially trained on smaller parts of the task, followed by human feedback on the broader task. Next, human demonstrations and comparisons were collected and fine-tuned on GPT-3. Finally, the summarization was done using behavioural cloning and reward modelling.
How it works
The model begins the inference by summarising small sections of the book and then recursively summarising the smaller summaries, then summarising those into a higher-level summary until the output is a summary of the entire book. “Our main result is a model that can be applied recursively to generate plausible summaries of entire books,” according to the research paper. The human labellers supervise and evaluate the model’s output by the tool, even if they have not read the books themselves.
Source: OpenAI’s paper
Approach
One of the challenges faced by large pretrained models is summarisation. OpenAI’s previous blogs discuss the method of training a model with reinforcement learning from human feedback. This method helped them align the model summaries with human preferences.
Source: OpenAI
This is the structure of the algorithm used for shorter paragraphs. But to present the same results on an entire book, the team applied ‘recursive task decomposition’.
The process involves a human ‘decomposing’ or breaking up their parent task into several subtasks. Each subtask is shorter and simpler than the parent task, and having the responses to the subtasks would help a human provide a training signal for the parent task.
This allows for easier evaluation by humans; the person doesn’t need to have read the book beforehand since they can refer to the shorter parts. It also helps trace the summary writing process, trace back to actual events in the book, and leverage the tool for books of unbound lengths.
An illustration of the process of breaking up the text in Alice’s Adventures In Wonderland to process a short summary.
Find more examples here.
Results
The summaries contained the important events from the book, abstractly synthesising the details, but the team also admitted to the tool, often leaving out important information or not grasping the broader context.
Still, the model proved to outperform OpenAI’s behavioural cloning baseline significantly. “A small number of summaries approach human-level quality,” the team noted. The ‘sensible’ summaries were evaluated to achieve a substantial rating, even matching the average quality of human-written summaries.
The ratings were 6/7 from humans who had read the book 5% of the time, and 5/7 rating from those who had read the book 15% of the time. The model achieved state-of-the-art results on the BookSum dataset for book-length summarisation. A zero-shot question-answering model can also use the summaries to obtain state-of-the-art on the NarrativeQA dataset for book-length question answering.
The results proved that combining recursive task decomposition with learning from human feedback can be a practical approach to scalable oversight for difficult long-document NLP tasks, broadening the scope for future models.
“Our current approach to this problem is to empower humans to evaluate machine learning model outputs using assistance from other models,” stated the blog. The team is hopeful of creating similar and better tools in the future to empower large scale empirical work on scaling alignment techniques.