Human Feedback Frenzy: How it Turns AI into Narcissistic, Control-Freak Machines

“The path I'm very excited for is using models like ChatGPT to assist humans at evaluating other AI systems,” said OpenAI’s Jan Leike
Listen to this story

OpenAI’s GPT-3.5 architecture, which runs ChatGPT, is equipped with reinforcement learning from the human feedback model (RLHF), a reward-based mechanism based on human feedback to improve its responses. Essentially, one can suppose that the chatbot is trained in real time by human inputs. 

However, the RLHF system has also had its own set of consequences. Sarah Rasmussen, a Cambridge University mathematician, gave the following example to show that the model favours being rewarded for achieving a desired outcome rather than having a definite idea of what is right. 

This is not just a one-off case. To test it further, we asked ChatGPT for the name of the current CEO of Twitter. In the first instance, it did give the right answer. But, upon further probe, it changed its stance. 


Sign up for your weekly dose of what's up in emerging technology.

It takes misleading examples given by humans as real. This is what makes models based on reinforcement learning gullible, preferring individual human responses more often than not. A research based on 154 datasets showed that large language models (LLMs) exhibit ‘sycophantic’ qualities for self-preservation. The reward-based model will be willing to produce particular results, obey particular feedback to ensure not being thrown out of the window. When such a model is deployed at scale, it has obvious ramifications. 

OpenAI’s Jan Leike recently addressed this issue, saying, “Reinforcement learning from human feedback won’t scale. It fundamentally assumes that humans can evaluate what the AI system is doing.” Here, in addition to several biases that come into play, he also refers to the human oversight which can occur in cases like spotting every bug in a codebase or finding all factual errors in a long essay as some areas where humans struggle to evaluate. 

Download our Mobile App

The problem, known as ‘scalable oversight’, essentially includes the difficulty in supervising large models or giving effective feedback, especially when applied to increasingly larger and more complex tasks. 

Large models, larger the problems

A recent paper by Anthropic AI, an AI safety and research company, delves into the impact of RLHF on large LMs. The researchers discovered one of the first cases of the phenomenon of inverse scaling in RLHF, where more RLHF makes LMs worse. They observed that more human feedback in reinforcement learning can lead to models expressing stronger political views (on gun rights and immigration) and a desire to avoid shut down. 


AI safety has been an issue raised by many scholars and research institutes in recent times. Recently, Deepmind CEO Demis Hassabis told Time in an interview, “I would advocate not moving fast and breaking things”. While Google and Deepmind (a subsidiary of Alphabet) have been fairly cautious until now in releasing any of their large language models for public use, OpenAI has been building in public. But it seems like OpenAI is taking a backseat. Last week, the company released research on the potential misuse of large language models. The report highlighted that these models have the capability to provide convincing and misleading output for use in influence operations. 

Leike mentions that there are several paths currently taken in response to the drawbacks of human-in-the-loop learning for algorithms. “The path I’m very excited for is using models like ChatGPT to assist humans at evaluating other AI systems,” he said. OpenAI has already been working in that direction, like in the paper, ‘Self-critiquing models for assisting human evaluators‘. 

The research indicated that AI assistants trained to help humans provide feedback on difficult tasks could identify 50% more flaws than non-assisted human feedback. With datasets from three different sources – which include summaries written by models, written by humans, and by humans deliberately writing misleading information – the “critique-writing model” was able to help humans in giving effective feedback to the AI model.  

The researchers found that large models were able to improve their outputs using the self-critique assistants, while small models were unable to do so. Finally, the researchers also made an important point – “better critiques helps models make better improvements than they do with worse critiques, or with no critiques”. 

Another of Anthropic AI’s research showed how humans could use AI systems to better oversee other AI systems. The paper takes an experiential design centred on tasks in which experts succeed but non-experts and language models alike fail. This is known as the ‘sandwiching’ concept. Then, the non-experts were asked to answer expert-level questions on two datasets (MMLU and time-limited QuALITY). The results were that non-expert participants who interacted with the unreliable large-language-model dialog assistant through chat substantially outperformed both the model alone and their own unaided performance. 

Work in Progress

Despite this, as the researchers concede in their limitations: “[Their] results are simply not strong enough to validate our simple human–model interaction protocol for use in high-stakes situations.“ Therefore, there is still a lot of work to be done in this area. And this is why, OpenAI is actively looking for researchers who can work with them in this area to find more effective alternatives to the current reinforcement learning models. 

Moreover, in a recent interview, we also heard OpenAI CEO Sam Altman stressing that they would not release the next iteration of the GPT model (considered to be of about a trillion parameters) until they’re sure if it is safe and responsible to do so. 

More Great AIM Stories

Ayush Jain
Ayush is interested in knowing how technology shapes and defines our culture, and our understanding of the world. He believes in exploring reality at the intersections of technology and art, science, and politics.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.