Listen to this story
OpenAI’s GPT-3.5 architecture, which runs ChatGPT, is equipped with reinforcement learning from the human feedback model (RLHF), a reward-based mechanism based on human feedback to improve its responses. Essentially, one can suppose that the chatbot is trained in real time by human inputs.
However, the RLHF system has also had its own set of consequences. Sarah Rasmussen, a Cambridge University mathematician, gave the following example to show that the model favours being rewarded for achieving a desired outcome rather than having a definite idea of what is right.
This is not just a one-off case. To test it further, we asked ChatGPT for the name of the current CEO of Twitter. In the first instance, it did give the right answer. But, upon further probe, it changed its stance.
It takes misleading examples given by humans as real. This is what makes models based on reinforcement learning gullible, preferring individual human responses more often than not. A research based on 154 datasets showed that large language models (LLMs) exhibit ‘sycophantic’ qualities for self-preservation. The reward-based model will be willing to produce particular results, obey particular feedback to ensure not being thrown out of the window. When such a model is deployed at scale, it has obvious ramifications.
OpenAI’s Jan Leike recently addressed this issue, saying, “Reinforcement learning from human feedback won’t scale. It fundamentally assumes that humans can evaluate what the AI system is doing.” Here, in addition to several biases that come into play, he also refers to the human oversight which can occur in cases like spotting every bug in a codebase or finding all factual errors in a long essay as some areas where humans struggle to evaluate.
The problem, known as ‘scalable oversight’, essentially includes the difficulty in supervising large models or giving effective feedback, especially when applied to increasingly larger and more complex tasks.
Large models, larger the problems
A recent paper by Anthropic AI, an AI safety and research company, delves into the impact of RLHF on large LMs. The researchers discovered one of the first cases of the phenomenon of inverse scaling in RLHF, where more RLHF makes LMs worse. They observed that more human feedback in reinforcement learning can lead to models expressing stronger political views (on gun rights and immigration) and a desire to avoid shut down.
AI safety has been an issue raised by many scholars and research institutes in recent times. Recently, Deepmind CEO Demis Hassabis told Time in an interview, “I would advocate not moving fast and breaking things”. While Google and Deepmind (a subsidiary of Alphabet) have been fairly cautious until now in releasing any of their large language models for public use, OpenAI has been building in public. But it seems like OpenAI is taking a backseat. Last week, the company released research on the potential misuse of large language models. The report highlighted that these models have the capability to provide convincing and misleading output for use in influence operations.
Leike mentions that there are several paths currently taken in response to the drawbacks of human-in-the-loop learning for algorithms. “The path I’m very excited for is using models like ChatGPT to assist humans at evaluating other AI systems,” he said. OpenAI has already been working in that direction, like in the paper, ‘Self-critiquing models for assisting human evaluators‘.
The research indicated that AI assistants trained to help humans provide feedback on difficult tasks could identify 50% more flaws than non-assisted human feedback. With datasets from three different sources – which include summaries written by models, written by humans, and by humans deliberately writing misleading information – the “critique-writing model” was able to help humans in giving effective feedback to the AI model.
The researchers found that large models were able to improve their outputs using the self-critique assistants, while small models were unable to do so. Finally, the researchers also made an important point – “better critiques helps models make better improvements than they do with worse critiques, or with no critiques”.
Another of Anthropic AI’s research showed how humans could use AI systems to better oversee other AI systems. The paper takes an experiential design centred on tasks in which experts succeed but non-experts and language models alike fail. This is known as the ‘sandwiching’ concept. Then, the non-experts were asked to answer expert-level questions on two datasets (MMLU and time-limited QuALITY). The results were that non-expert participants who interacted with the unreliable large-language-model dialog assistant through chat substantially outperformed both the model alone and their own unaided performance.
Work in Progress
Despite this, as the researchers concede in their limitations: “[Their] results are simply not strong enough to validate our simple human–model interaction protocol for use in high-stakes situations.“ Therefore, there is still a lot of work to be done in this area. And this is why, OpenAI is actively looking for researchers who can work with them in this area to find more effective alternatives to the current reinforcement learning models.
Moreover, in a recent interview, we also heard OpenAI CEO Sam Altman stressing that they would not release the next iteration of the GPT model (considered to be of about a trillion parameters) until they’re sure if it is safe and responsible to do so.