MITB Banner

Reinforcement Learning Craves Less Human, More AI

Google Research has proposed RLHF with AI feedback to train models better

Share

Listen to this story

Human feedback has proven essential in training machine learning algorithms. Back in 2019, the director of research at IBM emphasised that while reinforcement learning holds promise, relying solely on this approach can be challenging due to its heavy dependence on trial and error.

Apart from being Google DeepMind’s most preferred technique to train its popular algorithm like AlphaGo and its latest iteration AlphaStar, reinforcement learning from human feedback (RLHF) has been particularly effective in aligning large language models (LLMs) with human preferences.

Furthermore, a majority of the large language models have been trained using the reward approach including OpenAI’s ChatGPT. Even Meta AI’s framework PyTorch recently upgraded its RLHF elements for developers to easily build an RLHF training loop with limited RL knowledge.   

However, a prime hurdle lies in gathering high-quality human preference labels. This is where reinforcement learning from human feedback with AI feedback (RLAIF) comes into the picture, a novel framework by Google Research to train models with reduced reliance on human intervention. Researchers found that both exhibit similar performance, with RLHF having a slight edge, albeit not significantly so.

To Human or Not

One of the areas where ChatGPT has not been able to thrive yet is summarising documents. Since the chatbot debuted late last year, researchers have been exploring new methods for generating concise summaries via its more potent GPT-4-based paid version.

Interestingly, in the latest Google study it was discovered that human evaluators preferred RLAIF 71% of the time and RLHF 73% of the time. However, in a direct comparison, the two were equally favoured, with a 50% win rate. Furthermore, summaries from both the methods were preferred over human-written reference summaries, with RLAIF at 79% preference and RLHF at 80%.

Meanwhile, one important factor to consider is that both the recipes tend to produce longer summaries than the supervised fine-tuning (SFT) policy. This could contribute to the perceived quality improvements. The researchers conducted further analysis, which revealed that even after adjusting for length, both approaches, with and without AI feedback, outperformed the SFT policy by a similar margin.

While the community continues to grapple with the summarising issues, RLAIF appears to be a viable alternative to RLHF without the need of human annotation. However, the researchers acknowledge the need for conducting further experiments on a broader spectrum of natural language processing (NLP) tasks to validate these findings, a path they intend to explore in their future research.

Reinforced in Secret

Just a few weeks ago, Google DeepMind proposed another new algorithm called reinforced self-training (ReST) for language modelling. It follows a similar process of removing humans from the loop by letting language models build their own policy with a single initial command. While ReST finds application in various generative learning layouts, its expertise lies in machine translation.

In comparing, for ReST (Reward Estimation from Suboptimal Trajectories) with Online RL, a tactic frequently used in RLHF, the results indicate that the latter performs at par with the former when only one “Grow” step is used. 

However, when ReST incorporates multiple “Improve” steps, it highly surpasses Online RL in terms of higher rewards. Additionally, the study observed that Online RL exhibited an 8-point drop in BLEU score on the validation set, hinting at potential issues related to reward manipulation.

In contrast, ReST demonstrated an ability to improve reward model scores without adversely affecting other performance metrics, suggesting it may impose a lower “alignment tax” compared to Online RL methods.

These research developments have come in the wake of Project Gemini, which is ready to unseat ChatGPT as the premier generative AI tool, as per DeepMind boss Demis Hassabis.

The model anticipated for almost a year now is reportedly being trained via the company’s pioneering technique — reinforcement learning. Even though much official details have reached the media yet, reinforcement learning with AI is expected to play a huge role in the training process.

With the recent studies pointing towards the company’s interest in incorporating AI with reinforcement learning, we can’t wait to see what’s cooking in the research lab.

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.