MITB Banner

DeepMind Wants to Take Humans Out of RLHF

The algorithm toggles between generating synthetic training data in the Grow step and optimising policies using filtered data in the Improve step.

Share

DeepMind Wants to Take Humans Out of RLHF
Listen to this story

DeepMind, the king of reinforcement learning, has introduced the Reinforced Self-Training (ReST) algorithm, a technique poised to redefine the landscape of large language models (LLMs). ReST emerges as a formidable innovation in the realm of reinforcement learning from human feedback (RLHF), aiming to remove the humans from the loop and drive self-learning agents.

Click here to read the paper.

Central to the ReST approach is the novel decoupling of two distinct stages—Grow and Improve—which together usher in data efficiency and stability. The algorithm toggles between generating synthetic training data in the Grow step and optimising policies using filtered data in the Improve step. This break from the traditional online RL process proves instrumental in refining the model’s output quality.

According to the researchers, ReST offers a safeguard against “reward hacking,” a scenario where models exploit vulnerabilities in learned reward models.

Starting from an initial LLM policy, ReST operates by creating a dataset through the generation of samples based on the base policy. These samples are subsequently harnessed to enhance the LLM policy via offline RL algorithms. The distinct advantage of ReST over conventional online RLHF techniques is its heightened efficiency attributed to offline dataset production, permitting the recycling of data. 

Although ReST offers a versatile solution for various generative learning scenarios, the primary focus lies in its application within the context of machine translation. 

Unlike conventional approaches that solely maximise ‘likelihood’, ReST employs human preferences to bring about an alignment between model outputs and human desires. By doing so, it overcomes limitations intrinsic to online RLHF methods, like the computationally expensive requirement of continual new samples during training. 

Evaluation methods

ReST’s efficacy was put to the test on the challenging task of machine translation, where the goal is to convert input sequences into target output sequences. The ReST algorithm was formulated as a Markov Decision Process (MDP), leveraging well-established metrics such as BLEU, BLEURT, and Metric X for scoring translation quality. 

Notably, benchmarking on renowned translation quality benchmarks including IWSLT 2014, WMT 2020, and Web Domain, ReST demonstrated its prowess by outperforming the conventional online Proximal Policy Optimization (PPO) RL with an equivalent amount of data.

Underlying ReST’s remarkable results is its unique decoupling strategy, which serves a trifecta of benefits. First, the periodic generation of new data during the Grow phase enhances data efficiency, enabling iterative policy refinements. Second, real-time monitoring during Grow facilitates the identification of alignment issues and potential reward hacking. Lastly, the adoption of offline RL losses minimizes the risk of reward hacking compared to continuous online optimization methods.

The paper highlights that the best results were achieved with a simple supervised training loss for ReST. This underscores the complexity of offline RL in extensive discrete action spaces, emphasizing the need for further exploration into more effective offline RL algorithms for language tasks.

Despite its achievements, ReST does expose a discrepancy between automated metrics and human evaluations, indicating that learned reward models still fall short of fully representing human judgment. This misalignment underscores the continuing importance of integrating human preferences into the algorithm through annotation data.

Last month, Google DeepMind had also introduced RT-2, the first ever vision-language-action (VLA) model that is more efficient in robot control than any model before. Aptly named “robotics transformer” or RT, this advancement is set to change the way robots interact with their environment and execute tasks with precision.

RT-2 is a learning wizard. The model can grow smarter as time goes by and easily understand both words and pictures. The problem-solving model can tackle tricky challenges it has never faced before or been trained on. 

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.