Last updated April 3, 2024
In AI News & Update

DeepMind Wants to Take Humans Out of RLHF

The algorithm toggles between generating synthetic training data in the Grow step and optimising policies using filtered data in the Improve step.

Share

Published on August 22, 2023

by Mohit Pandey

Listen to this story

DeepMind, the king of reinforcement learning, has introduced the Reinforced Self-Training (ReST) algorithm, a technique poised to redefine the landscape of large language models (LLMs). ReST emerges as a formidable innovation in the realm of reinforcement learning from human feedback (RLHF), aiming to remove the humans from the loop and drive self-learning agents.

Click here to read the paper.

Central to the ReST approach is the novel decoupling of two distinct stages—Grow and Improve—which together usher in data efficiency and stability. The algorithm toggles between generating synthetic training data in the Grow step and optimising policies using filtered data in the Improve step. This break from the traditional online RL process proves instrumental in refining the model’s output quality.

According to the researchers, ReST offers a safeguard against “reward hacking,” a scenario where models exploit vulnerabilities in learned reward models.

Ok this is awesome, Reinforced Self Training, new RL finetuning method. 1 more step towards fully autonomous machines and the beginning of the end of manual finetuning (1 yr tops)
https://t.co/uuaUnSivBM pic.twitter.com/MKjkYDZ9hD
— Far El (@far__el) August 21, 2023

Starting from an initial LLM policy, ReST operates by creating a dataset through the generation of samples based on the base policy. These samples are subsequently harnessed to enhance the LLM policy via offline RL algorithms. The distinct advantage of ReST over conventional online RLHF techniques is its heightened efficiency attributed to offline dataset production, permitting the recycling of data.

Although ReST offers a versatile solution for various generative learning scenarios, the primary focus lies in its application within the context of machine translation.

Unlike conventional approaches that solely maximise ‘likelihood’, ReST employs human preferences to bring about an alignment between model outputs and human desires. By doing so, it overcomes limitations intrinsic to online RLHF methods, like the computationally expensive requirement of continual new samples during training.

Evaluation methods

ReST’s efficacy was put to the test on the challenging task of machine translation, where the goal is to convert input sequences into target output sequences. The ReST algorithm was formulated as a Markov Decision Process (MDP), leveraging well-established metrics such as BLEU, BLEURT, and Metric X for scoring translation quality.

Notably, benchmarking on renowned translation quality benchmarks including IWSLT 2014, WMT 2020, and Web Domain, ReST demonstrated its prowess by outperforming the conventional online Proximal Policy Optimization (PPO) RL with an equivalent amount of data.

Underlying ReST’s remarkable results is its unique decoupling strategy, which serves a trifecta of benefits. First, the periodic generation of new data during the Grow phase enhances data efficiency, enabling iterative policy refinements. Second, real-time monitoring during Grow facilitates the identification of alignment issues and potential reward hacking. Lastly, the adoption of offline RL losses minimizes the risk of reward hacking compared to continuous online optimization methods.

The paper highlights that the best results were achieved with a simple supervised training loss for ReST. This underscores the complexity of offline RL in extensive discrete action spaces, emphasizing the need for further exploration into more effective offline RL algorithms for language tasks.

Despite its achievements, ReST does expose a discrepancy between automated metrics and human evaluations, indicating that learned reward models still fall short of fully representing human judgment. This misalignment underscores the continuing importance of integrating human preferences into the algorithm through annotation data.

Last month, Google DeepMind had also introduced RT-2, the first ever vision-language-action (VLA) model that is more efficient in robot control than any model before. Aptly named “robotics transformer” or RT, this advancement is set to change the way robots interact with their environment and execute tasks with precision.

RT-2 is a learning wizard. The model can grow smarter as time goes by and easily understand both words and pictures. The problem-solving model can tackle tricky challenges it has never faced before or been trained on.

Access all our open Survey & Awards Nomination forms in one place