Why Is Grade School Level Maths So Difficult For AI?

OpenAI has released an AI system capable of completing mathematics problems at a grade school level

OpenAI has developed an AI system capable of completing mathematics problems at a grade school level. The system was able to solve almost as many problems as a sample of 9-12-year-olds (90 per cent). The kids scored 60% on a test from the research dataset, while the AI system scored 55%.

Why it’s hard for models to solve math problems

The research titled “Training Verifiers to Solve Math Word Problems,” says that when multi-step mathematical reasoning is needed, even the largest models fail to deliver. A big challenge is that mathematical reasoning comes with high sensitivity to individual mistakes. Autoregressive models do not come with any method to rectify their own errors when they come out with a solution.

The research adds that solutions that “veer off-course” cannot be recovered. Using generative methods and extrapolating from current trends is also not feasible as it will require an exorbitant parameter count.

Image: OpenAI

What OpenAI suggests

  • It proposes training verifiers to evaluate the correctness of model generated solutions.
  • The researchers sampled a fixed number of candidate solutions at test time. They then selected the solution ranked highest by the verifier.
  • It has released GSM8K, which is a dataset of 8.5K high-quality problems at the grade school math level. While building the dataset, the focus was on high quality, high diversity, moderate difficulty. (The problems did not require any concepts beyond early Algebra). The majority of the problems can be solved without explicitly defining a variable. 
  • Each problem takes two to eight steps to solve. They can be solved by doing a sequence of elementary calculations using basic arithmetic operations. 
  • Solutions are written in natural language. This makes it more readily interpretable by humans. The team instructed problem writers to explain their work as much as possible. They were allowed to write solutions by following their own linguistic styles.

Fine Tuning and Verification

The researchers work with two methods- finetuning and verification. For both methods, the researchers used models from the GPT-3 family as initialization, with the main focus on the 175B and 6B model sizes.

In finetuning, it has the same language modelling objective as the generative pretraining in GPT-3. The researchers evaluate performance at test time by auto-regressively sampling a single low-temperature solution and checking whether the final answer is correct.

Verification consists of – sampling multiple high-temperature solutions, assigning each solution a score and then outputting the highest-ranked solution. The verifiers are trained to evaluate the correctness of solutions, and the training signal is determined by whether or not the solution reached the correct final answer.

The models fail to accurately perform calculations quite often. The team trained all models to use a calculator by injecting calculation annotations into the training set to solve this issue. To solve a new problem at test time, the research team generated 100 candidate solutions. Then, they selected the solution that is ranked highest by the verifier.

What OpenAI found out

The researchers at OpenAI found a strong boost in performance from verification when the data set is large enough. With small datasets, the team detects that the verifiers do not learn additional properties of mathematical reasoning but overfit by memorizing the final answers in the training set. 

Image: OpenAI

6B parameter verification slightly outperforms a finetuned 175B parameter model on the full training set. Token-level verifiers are less prone to overfitting than solution-level verifiers.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM