Do Large Machine Learning Models Struggle At Maths?

In 1960, Nobel Laureate and American physicist Eugene Wigner wrote about the ‘unreasonable effectiveness of mathematics in natural sciences’. Mathematics is called the language of nature for a reason. That’s why the ‘Is math invented or discovered?’ debate never gets old. Mathematics exerts its influence on literally every field.

Mathematics is also the building block of machine learning models. ML practitioners use mathematics to analyse a problem, pick out better heuristics, and club both to generate an answer. Despite the critical role mathematics plays in machine learning, even state-of-art models struggle at maths.

A new study by the researchers at the University of California, Berkeley, have now introduced the MATH dataset. The team said the dataset provides a detailed assessment of a model’s mathematical ability across difficulties and subjects.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

What Is MATH Dataset?

The MATH dataset consists of 12,500 problems taken from various high school mathematics competitions. The dataset measures the problem-solving ability of large and general-purpose language models. A machine learning model generates a sequence for a given problem from the MATH dataset and encodes the final answer. 

MATH problems are labelled from 1 to 5, depending on the difficulty level and span across seven subjects, including geometry, number theory, algebra calculus, statistics, and linear algebra. For problems of geometry, diagrams can be specified with the Asymptote language.

Since step-by-step solutions also accompany the problems, language models can learn to answer questions they haven’t been exposed to before. The step-by-step approach allows models to perform intermediate computations instead of giving the final answer immediately. 

Recognising the need to train the model on maths fundamentals before exposing to MATH that cover advanced problem-solving techniques, the team also released the Auxiliary Mathematics Problems and Solutions (AMPS). The ‘pretraining corpus’ has over 100,000 problems from Khan Academy with solutions and 5 million problems, based on 100 hand-designed modules, generated using Mathematica scripts. 


When the MATH dataset was tested for large language models, including GPT-3, the accuracies were found to be abysmally low, ranging from 2.9 percent to 6.9 percent. However, on the flip side, the models achieved up to 15 percent accuracy on the easiest level. When evaluated on humans, a PhD student with no specialisation in Mathematics attained 40 percent, while a three-time Olympiad gold medalist scored 90 percent.

Further, having the models generate a step-by-step solution before producing the final answer reduced accuracy. This was because, while many of these steps were related to the question, they were not logical. 

The researchers found simply increasing the amount of training time, and the parameters proved extremely costly, although they did improve performance in a few cases. The researchers have open-sourced both MATH and AMPS to encourage and facilitate further research in this direction.


OpenAI recently introduced GPT-f, an automated prover and proof assistant for the Metamath formalisation language. Metamath is a language that expresses theorems in abstract mathematics along with proofs that a computer program can validate. 

Last year, Facebook built an AI system that can solve complex mathematical problems using symbolic reasoning. The team gave a system to represent mathematical expressions as a language and then treating the solutions as a translation problem for sequence-to-sequence neural networks.

Wrapping Up

While most other text-based tasks are already nearly solved by enormous Transformers, MATH is notably different. We showed that accuracy is slowly increasing and, if trends continue, the community will need to discover conceptual and algorithmic breakthroughs to attain strong performance on MATH, the researchers stated.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.