OpenAI Might Just Have Solved The Issue Of Faulty Reward Functions In the Wild

The human labellers supervise and evaluate the model’s output by the tool, even if they have not read the books themselves.

In 2016, OpenAI published a blog post, ‘Faulty Reward Functions in the Wild’, discussing an AI model that got creative and found a ‘counterintuitive’ way to optimise and reach its goal. The company realised the need to design a safe AI system to avoid misinterpretation of the specified goals. 

Brian Christian’s book, The Alignment Problem, talks about this problem- to ensure that ML models act per human intentions. Now, OpenAI has introduced its tool for a scalable solution to the alignment problem. According to the OpenAI team, it “needs to work on tasks where model outputs are difficult or time-consuming for humans to evaluate.” The team tested this model to summarise an entire book to demonstrate this solution. 

Introducing the tool

OpenAI’s tool combines recursive task decomposition and learning from human feedback. The model is initially trained on smaller parts of the task, followed by human feedback on the broader task. Next, human demonstrations and comparisons were collected and fine-tuned on GPT-3. Finally, the summarization was done using behavioural cloning and reward modelling. 


Sign up for your weekly dose of what's up in emerging technology.

How it works

The model begins the inference by summarising small sections of the book and then recursively summarising the smaller summaries, then summarising those into a higher-level summary until the output is a summary of the entire book. “Our main result is a model that can be applied recursively to generate plausible summaries of entire books,” according to the research paper. The human labellers supervise and evaluate the model’s output by the tool, even if they have not read the books themselves. 

Source: OpenAI’s paper


One of the challenges faced by large pretrained models is summarisation. OpenAI’s previous blogs discuss the method of training a model with reinforcement learning from human feedback. This method helped them align the model summaries with human preferences.

Source: OpenAI

This is the structure of the algorithm used for shorter paragraphs. But to present the same results on an entire book, the team applied ‘recursive task decomposition’. 

The process involves a human ‘decomposing’ or breaking up their parent task into several subtasks. Each subtask is shorter and simpler than the parent task, and having the responses to the subtasks would help a human provide a training signal for the parent task. 

This allows for easier evaluation by humans; the person doesn’t need to have read the book beforehand since they can refer to the shorter parts. It also helps trace the summary writing process, trace back to actual events in the book, and leverage the tool for books of unbound lengths. 

An illustration of the process of breaking up the text in Alice’s Adventures In Wonderland to process a short summary. 

Find more examples here


The summaries contained the important events from the book, abstractly synthesising the details, but the team also admitted to the tool, often leaving out important information or not grasping the broader context. 

Still, the model proved to outperform OpenAI’s behavioural cloning baseline significantly. “A small number of summaries approach human-level quality,” the team noted. The ‘sensible’ summaries were evaluated to achieve a substantial rating, even matching the average quality of human-written summaries. 

The ratings were 6/7 from humans who had read the book 5% of the time, and 5/7 rating from those who had read the book 15% of the time. The model achieved state-of-the-art results on the BookSum dataset for book-length summarisation. A zero-shot question-answering model can also use the summaries to obtain state-of-the-art on the NarrativeQA dataset for book-length question answering.

The results proved that combining recursive task decomposition with learning from human feedback can be a practical approach to scalable oversight for difficult long-document NLP tasks, broadening the scope for future models. 

“Our current approach to this problem is to empower humans to evaluate machine learning model outputs using assistance from other models,” stated the blog. The team is hopeful of creating similar and better tools in the future to empower large scale empirical work on scaling alignment techniques. 

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM