Reinforcement Learning Craves Less Human, More AI

Google Research has proposed RLHF with AI feedback to train models better

Share

Published on October 6, 2023

by Tasmia Ansari

Listen to this story

Human feedback has proven essential in training machine learning algorithms. Back in 2019, the director of research at IBM emphasised that while reinforcement learning holds promise, relying solely on this approach can be challenging due to its heavy dependence on trial and error.

Apart from being Google DeepMind’s most preferred technique to train its popular algorithm like AlphaGo and its latest iteration AlphaStar, reinforcement learning from human feedback (RLHF) has been particularly effective in aligning large language models (LLMs) with human preferences.

Furthermore, a majority of the large language models have been trained using the reward approach including OpenAI’s ChatGPT. Even Meta AI’s framework PyTorch recently upgraded its RLHF elements for developers to easily build an RLHF training loop with limited RL knowledge.

However, a prime hurdle lies in gathering high-quality human preference labels. This is where reinforcement learning from human feedback with AI feedback (RLAIF) comes into the picture, a novel framework by Google Research to train models with reduced reliance on human intervention. Researchers found that both exhibit similar performance, with RLHF having a slight edge, albeit not significantly so.

To Human or Not

One of the areas where ChatGPT has not been able to thrive yet is summarising documents. Since the chatbot debuted late last year, researchers have been exploring new methods for generating concise summaries via its more potent GPT-4-based paid version.

Interestingly, in the latest Google study it was discovered that human evaluators preferred RLAIF 71% of the time and RLHF 73% of the time. However, in a direct comparison, the two were equally favoured, with a 50% win rate. Furthermore, summaries from both the methods were preferred over human-written reference summaries, with RLAIF at 79% preference and RLHF at 80%.

Meanwhile, one important factor to consider is that both the recipes tend to produce longer summaries than the supervised fine-tuning (SFT) policy. This could contribute to the perceived quality improvements. The researchers conducted further analysis, which revealed that even after adjusting for length, both approaches, with and without AI feedback, outperformed the SFT policy by a similar margin.

While the community continues to grapple with the summarising issues, RLAIF appears to be a viable alternative to RLHF without the need of human annotation. However, the researchers acknowledge the need for conducting further experiments on a broader spectrum of natural language processing (NLP) tasks to validate these findings, a path they intend to explore in their future research.

Reinforced in Secret

Just a few weeks ago, Google DeepMind proposed another new algorithm called reinforced self-training (ReST) for language modelling. It follows a similar process of removing humans from the loop by letting language models build their own policy with a single initial command. While ReST finds application in various generative learning layouts, its expertise lies in machine translation.

In comparing, for ReST (Reward Estimation from Suboptimal Trajectories) with Online RL, a tactic frequently used in RLHF, the results indicate that the latter performs at par with the former when only one “Grow” step is used.

However, when ReST incorporates multiple “Improve” steps, it highly surpasses Online RL in terms of higher rewards. Additionally, the study observed that Online RL exhibited an 8-point drop in BLEU score on the validation set, hinting at potential issues related to reward manipulation.

In contrast, ReST demonstrated an ability to improve reward model scores without adversely affecting other performance metrics, suggesting it may impose a lower “alignment tax” compared to Online RL methods.

These research developments have come in the wake of Project Gemini, which is ready to unseat ChatGPT as the premier generative AI tool, as per DeepMind boss Demis Hassabis.

The model anticipated for almost a year now is reportedly being trained via the company’s pioneering technique — reinforcement learning. Even though much official details have reached the media yet, reinforcement learning with AI is expected to play a huge role in the training process.

With the recent studies pointing towards the company’s interest in incorporating AI with reinforcement learning, we can’t wait to see what’s cooking in the research lab.

Access all our open Survey & Awards Nomination forms in one place