Listen to this story
OpenAI chatbot ChatGPT excels at a plethora of tasks, like script writing, explaining complex topics, debugging, code explaining and others, but performs poorly when it comes to maths.
Stanford University and University of California, Berkeley, recently published a research paper that stated that large language models (LLMs) can perform simple maths operations when numbers are small, but struggle with large numbers suggesting that LLMs have not learned the underlying rules needed to perform these arithmetic operations. It further mentioned, even with GPT-4 improvements on the MATHS dataset, errors largely occur due to arithmetic and calculation mistakes.
The rival company, Google, has acknowledged the issue in LLMs and stepped in to teach models like ChatGPT to reason better algorithmically. The work by Google researchers titled, ‘Teaching language models to reason algorithmically’, takes the in-context learning approach and introduces an algorithm better at reasoning.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
In-context learning is teaching a new skill where the researchers guide it through the process step-by-step instead of overwhelming it with all instructions upfront. The method refers to a model’s ability to perform a task after seeing a few examples of it within the context of the model.
They also presented a prompting technique for general purpose language models to have strong generalisation on maths problems more difficult than the ones in prompt. The technique builds upon other rationale-augmented approaches (e.g., scratchpad and chain-of-thought). Lastly, they demonstrated that a model can ‘reliably execute algorithms on out-of-distribution examples with the right prompts’.
A Below Average Student
ChatGPT has become worse at performing certain basic maths operations — as it is getting better at other things. The same study highlighted that the high-profile chatbot is getting worse as compared to its performance earlier in March.
Researchers said the deterioration is due to an AI phenomenon known as drift, where attempts to improve one part of complex models make other parts of the models worse.
To track performance, James Zou, Stanford professor affiliated with the school’s AI lab and his colleagues, Matei Zaharia and Lingjiao Chen fed ChatGPT 1,000 different numbers. In March, the paid GPT-4 version impressively identified whether 84% of the numbers were prime or not. By June, the success rate dropped to 51%.
Apart from getting the answers wrong, ChatGPT also got a thumbs down for its attempt to show the researchers how it arrived at certain conclusions. As part of the research, the researchers additionally asked the chatbot to lay out its “chain of thought”, the term for when a chatbot explains its reasoning. In March, it did so, but by June it stopped showing its step-by-step reasoning.
The recent Google study tries to tackle this issue with its in-context learning approach. These discoveries suggest that exploring longer contexts, and prompting more informative explanations could provide valuable research.
The Wolfram Headmasters
A pioneer in fusing technology with maths education, Wolfram Research has been working with ChatGPT’s parent OpenAI, to bring better maths capabilities in AI models. ”We have seen some interesting results with our LLM. I tried to run a British ‘A’ level maths, an exam students take before University, and ChatGPT alone got 43% which is quite impressive, but Wolfram plus ChatGPT got 96%,” cofounder of the company, Conrad Wolfram revealed in an interview with AIM.
“Game over for humans on that one,” he quipped.
Notably, when a same maths teaser was thrown at ChatGPT version 3.5, 4, and Wolfram Plugin — what is the smallest integer greater than 95,555, in which there will be 4 identical numbers? — only the latter got it right in the first attempt.
The Wolfram + ChatGPT plugin not only solves maths step-by-step but it can also present them visually if specifically prompted to do so. Based on the prompts, it can go a step further and represent the data in different forms like graphs, charts, and histograms.
The plugin can turn queries in natural language into beautiful mathematical equations. It can do so since it combines ChatGPT’s human mimicking technology and Wolfram’s strong foundation of symbolic programming language that focuses on expressing ideas in a computational form.
On one hand, Wolfram is making strides with its plug-in and on the other, researchers show models performance worsening. In the current landscape, Google’s latest in-context learning approach can help AI chatbots become an above-average student.