Listen to this story
Open AI introduced custom instructions for ChatGPT yesterday. This feature allows users to add specific requirements to their prompts, which would be considered in every conversation going forward so users don’t have to repeat themselves. Will this update help the recent criticisms of their poor responses?
For months, users have taken to different platforms to complain about the dipping performance of GPT-4. There is a continuous discussion on OpenAI’s forum over the many ways that GPT’s performance has dropped. The company’s VP of product, Peter Welinder, however, dismissed these claims and tweeted, “No we haven’t made GPT-4 dumber. Quite the opposite.”
Regardless, the poor performance could be causing a decline in the number of users on the platform.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Researchers test these claims
To systematically study these assertions, researchers at Stanford University and UC Berkeley, explored how ChatGPT’s behaviour has changed over time. They published a paper on Tuesday which confirmed that GPT’s responses to questions have indeed changed over time. The paper assesses the chatbot’s abilities in maths, code generation, problem solving and answering sensitive questions. The two time-points are only a few months apart — in March and June this year. It hasn’t been a surprise to most that the findings of the research corroborates that GPT-4’s performance has decreased in all these areas.
In Math, the accuracy of the responses dropped from 97.6% to 2.4%. In code generation, the directly executable generations decreased from 52% to 10% with added errors in June. Answering sensitive questions also declined to 5% of the queries answered compared to 21% in May. Only in visual reasoning, the overall performance saw a slight improvement of 2% of the exact match rate from March to June.
It is interesting to note that GPT-3.5 has improved in comparison to its successor in maths. On the whole, GPT-3.5 has also improved in answering sensitive questions and in visual reasoning from its previous benchmark.
Response to the paper
This study evaluates the variations in the behaviour over a short period of time but does not mention why this is happening. The paper concludes GPT-4 does not do well even with the popular chain-of-thought technique, which significantly improves answers. Of late, GPT-4 has not been following this trend, skipping the intermediate steps and answering incorrectly.
Experts assume is that OpenAI is pushing changes continuously, fine tuning the models, and there are no methods to evaluate how this process works or whether the models are improving or regressing with the changes. Meanwhile, others are discussing the inversely proportional relationship between alignment and usefulness and the increased alignment of the models along with attempts of making it faster and cheaper is contributing to its errors.
Only behaviour, not GPT-4’s capabilities
One group of experts questioned the very basis of the paper. Simon Willison tweeted that he found the paper relatively unconvincing. He further said, “A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks.” He also finds other problems with the paper’s methodology. “It looks to me like they ran temperature 0.1 for everything,” he said. “It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don’t think it tells us much about real-world use cases for the models.”
Arvind Narayanan, CS professor at Princeton, also explains that the paper misrepresents GPT-4 and to say that it has degraded over time is an oversimplification of what the paper found. He also questioned the methods used by the scientists saying the capabilities of the model aren’t the same as its behaviour.
At the end of Arvind’s analysis, he says, “In short, the new paper doesn’t show that GPT-4 capabilities have degraded. But it is a valuable reminder that the kind of fine tuning that LLMs regularly undergo can have unintended effects, including drastic behaviour changes on some tasks. Finally, the pitfalls we uncovered are a reminder of how hard it is to quantitatively evaluate language models.”
It makes it even more difficult to assess language models when ones like OpenAI take a closed approach to AI. Sam Altman refuses to reveal the source of training materials, code, neural network weights or even the architecture of the model leaving the rest of us to only speculate and arrive at the results through anonymous sources. This leaves researchers groping in the dark to define the properties of the system they’re trying to evaluate.
Learn to Prompt Better
Good prompts are the antidote to the sickness GPT is going through. It arguably might have gotten worse over time, but the only way to get the responses you need is to make sure you are giving it the right prompts. There are multiple courses online that take you through prompts for specific tasks. In addition, understanding how each language model is trained improves your chances with formulating better prompts. Simple, yet effective practices include being specific, asking for a step-by-step explanation, including (limited) context, mentioning tone, style and examples improve the responses.