MITB Banner

Human Feedback Frenzy: How it Turns AI into Narcissistic, Control-Freak Machines

“The path I'm very excited for is using models like ChatGPT to assist humans at evaluating other AI systems,” said OpenAI’s Jan Leike

Share

Listen to this story

OpenAI’s GPT-3.5 architecture, which runs ChatGPT, is equipped with reinforcement learning from the human feedback model (RLHF), a reward-based mechanism based on human feedback to improve its responses. Essentially, one can suppose that the chatbot is trained in real time by human inputs. 

However, the RLHF system has also had its own set of consequences. Sarah Rasmussen, a Cambridge University mathematician, gave the following example to show that the model favours being rewarded for achieving a desired outcome rather than having a definite idea of what is right. 

https://twitter.com/SarahDRasmussen/status/1609972620761473027

This is not just a one-off case. To test it further, we asked ChatGPT for the name of the current CEO of Twitter. In the first instance, it did give the right answer. But, upon further probe, it changed its stance. 

It takes misleading examples given by humans as real. This is what makes models based on reinforcement learning gullible, preferring individual human responses more often than not. A research based on 154 datasets showed that large language models (LLMs) exhibit ‘sycophantic’ qualities for self-preservation. The reward-based model will be willing to produce particular results, obey particular feedback to ensure not being thrown out of the window. When such a model is deployed at scale, it has obvious ramifications. 

OpenAI’s Jan Leike recently addressed this issue, saying, “Reinforcement learning from human feedback won’t scale. It fundamentally assumes that humans can evaluate what the AI system is doing.” Here, in addition to several biases that come into play, he also refers to the human oversight which can occur in cases like spotting every bug in a codebase or finding all factual errors in a long essay as some areas where humans struggle to evaluate. 

The problem, known as ‘scalable oversight’, essentially includes the difficulty in supervising large models or giving effective feedback, especially when applied to increasingly larger and more complex tasks. 

Large models, larger the problems

A recent paper by Anthropic AI, an AI safety and research company, delves into the impact of RLHF on large LMs. The researchers discovered one of the first cases of the phenomenon of inverse scaling in RLHF, where more RLHF makes LMs worse. They observed that more human feedback in reinforcement learning can lead to models expressing stronger political views (on gun rights and immigration) and a desire to avoid shut down. 

Source: https://arxiv.org/pdf/2212.09251.pdf

AI safety has been an issue raised by many scholars and research institutes in recent times. Recently, Deepmind CEO Demis Hassabis told Time in an interview, “I would advocate not moving fast and breaking things”. While Google and Deepmind (a subsidiary of Alphabet) have been fairly cautious until now in releasing any of their large language models for public use, OpenAI has been building in public. But it seems like OpenAI is taking a backseat. Last week, the company released research on the potential misuse of large language models. The report highlighted that these models have the capability to provide convincing and misleading output for use in influence operations. 

Leike mentions that there are several paths currently taken in response to the drawbacks of human-in-the-loop learning for algorithms. “The path I’m very excited for is using models like ChatGPT to assist humans at evaluating other AI systems,” he said. OpenAI has already been working in that direction, like in the paper, ‘Self-critiquing models for assisting human evaluators‘. 

The research indicated that AI assistants trained to help humans provide feedback on difficult tasks could identify 50% more flaws than non-assisted human feedback. With datasets from three different sources – which include summaries written by models, written by humans, and by humans deliberately writing misleading information – the “critique-writing model” was able to help humans in giving effective feedback to the AI model.  

The researchers found that large models were able to improve their outputs using the self-critique assistants, while small models were unable to do so. Finally, the researchers also made an important point – “better critiques helps models make better improvements than they do with worse critiques, or with no critiques”. 

Another of Anthropic AI’s research showed how humans could use AI systems to better oversee other AI systems. The paper takes an experiential design centred on tasks in which experts succeed but non-experts and language models alike fail. This is known as the ‘sandwiching’ concept. Then, the non-experts were asked to answer expert-level questions on two datasets (MMLU and time-limited QuALITY). The results were that non-expert participants who interacted with the unreliable large-language-model dialog assistant through chat substantially outperformed both the model alone and their own unaided performance. 

Work in Progress

Despite this, as the researchers concede in their limitations: “[Their] results are simply not strong enough to validate our simple human–model interaction protocol for use in high-stakes situations.“ Therefore, there is still a lot of work to be done in this area. And this is why, OpenAI is actively looking for researchers who can work with them in this area to find more effective alternatives to the current reinforcement learning models. 

Moreover, in a recent interview, we also heard OpenAI CEO Sam Altman stressing that they would not release the next iteration of the GPT model (considered to be of about a trillion parameters) until they’re sure if it is safe and responsible to do so. 

Share
Picture of Ayush Jain

Ayush Jain

Ayush is interested in knowing how technology shapes and defines our culture, and our understanding of the world. He believes in exploring reality at the intersections of technology and art, science, and politics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.