MITB Banner

AI Alignment is a Joke

In many cases, uncensored models that do not go through the RLHF phase actually perform better than aligned models 

Share

AI Alignment is a Joke
Listen to this story

OpenAI has been crystal clear about one of the most important aspects behind the success of ChatGPT — Reinforcement Learning from Human Feedback (RLHF). Everyone nodded. And since then, they have all been building models using RLHF. 

By training LLMs through interactions with human evaluators, RLHF seeks to improve the performance of AI models in real-world applications, but in turn it induces biases and reduces the robustness of the models. A recent paper, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback by researchers from Harvard, Stanford, MIT, UC Berkeley, and many other universities, discusses the problems with the RLHF approach. 

Good, but not the best

According to the paper, obtaining high-quality feedback from human evaluators is one of the primary challenges in RLHF. Human beings, while capable of providing valuable feedback, are susceptible to various limitations and biases. Misaligned evaluators might have difficulty in understanding the context or objectives of the AI model, leading to suboptimal feedback. The complexity of supervision, especially in long conversations, can also hinder the accurate assessment of model performance.

Besides, data quality is another critical concern. Human evaluators may unintentionally provide inconsistent or inaccurate feedback due to factors like limited attention, time constraints, and cognitive biases. Even with well-intentioned evaluators, disagreement can arise due to subjective interpretations and varying perspectives.

The form of feedback used in RLHF can further compound these challenges. Depending on the evaluation method, evaluators may provide binary judgments, rankings, or comparisons, each with its own strengths and weaknesses. Selecting the most appropriate form of feedback for a specific AI task can be complex, leading to potential discrepancies in the training process.

A fundamental issue in RLHF is accurately representing individual human values with a reward function. Human preferences are context-dependent, dynamic, and often influenced by societal and cultural factors. Designing a reward function that encompasses the complexity of human values is a formidable task. Incorrect assumptions about human decision-making or using a reward model that neglects personality and context-dependence can lead to misaligned AI models.

Why so much alignment?

The diversity of human evaluators further complicates the reward modelling process. Different evaluators may have unique preferences, expertise, and cultural backgrounds. Attempting to consolidate their feedback into a single reward model might overlook important disagreements and result in biassed AI models that favour majority opinions. This could be of disadvantage to underrepresented groups and perpetuate existing societal biases.

To address these challenges, researchers must explore techniques for representing preferences in more nuanced and context-aware ways. Utilising ensemble reward models that consider multiple evaluators’ feedback, or personalised reward models that cater to individual preferences, can help capture the diversity of human values.

Transparently addressing potential biases in the data collection process and conducting thorough evaluations to identify and mitigate harmful biases are essential steps in responsible AI development. 

To overcome these data constraints, researchers should explore methods for cost-effective data collection that do not compromise data quality and diversity. Understandably, training on GPT-output data for quicker alignment has been the new trend, but this in the end brings in the same bias into other models as well. So, there has been no conclusion on this so far. 

The fundamental challenges of RLHF have significant implications for AI alignment. While some problems may have tractable solutions through technical progress, others may not have complete solutions and may require alternative approaches. Researchers must be cautious about relying solely on RLHF for AI alignment, as certain challenges might not be fully addressed through this method alone.

Essentially, RLHF leads to over-finetuning of a model that may handicap its capabilities. This phenomenon is called the alignment tax of AI models. When a model goes to several benchmarks testing with humans in the loop trying to make the model as aligned and as “politically correct” as possible, it loses a lot of its performance.

Alignment tax is the extra cost that an AI system has to incur to stay more aligned, at the cost of building an unaligned or uncensored model, which ultimately also hinders its performance. That is why, in a lot of cases, uncensored models that do not go through the RLHF phase actually perform better than aligned models. 

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.