Active Hackathon

After grade school level math, OpenAI now tackles high school Math Olympiad problems

OpenAI said that it had achieved a new state-of-the-art (41.2 per cent vs 29.3 per cent) on the miniF2F benchmark.

Innovation leader OpenAI has come out with a neural theorem prover for Lean that can solve a range of difficult high-school olympiad problems. Some of these problems are from AMC12 and AIME competitions, with two being adapted from the International Math Olympiad (IMO). OpenAI said that the prover “uses a language model to find proofs of formal statements.” Every time it finds new proof, it is used as new training data. This improves the neural network and helps to find solutions to harder statements.

miniF2F benchmark

OpenAI said that it had achieved a new state-of-the-art (41.2 per cent vs 29.3 per cent) on the miniF2F benchmark. It consists of 488 problem statements taken from the AIME, AMC, and IMO and also material from high-school and undergraduate mathematics courses. Due to the scarcity of libraries specific to these types of problems, this dataset fits perfectly well for the study.


Sign up for your weekly dose of what's up in emerging technology.

Completing mathematics problems at a grade school level

It is not the first for OpenAI to come up with algorithms to look into the problem of solving mathematics problems. A few months ago, it had developed an AI system capable of completing mathematics problems at a grade school level. It was able to solve almost as many problems as a sample of 9-12-year-olds (90 per cent). The kids scored 60 per cent on a test from the research dataset, while the AI system scored 55 per cent.

In the paper titled “Formal Mathematics Statement Curriculum Learning“, the research team says that dealing with formal mathematics has two main issues. 

Infinite action space

What this means is that for every step of a proof search, a model has to choose from an infinite set of tactics. This may include “exogenous mathematical terms that have to be generated.”

How OpenAI deals with it

This model solves this issue by sampling actions from a language model while undergoing a search. OpenAI says, “Language models have the capability to generate the tactic calls as well as the original mathematical terms often required as arguments.”

No direct self-play setup

The paper says that in formal mathematics, a prover plays against a set of statements to prove. When faced with a hard statement, there is no reframing as such of the setup which can allow the prover to generate intermediary easier statements to work on first.

OpenAI adds, “Our basis for addressing the lack of self-play is the observation that the key role of self-play in 2-player games is to provide an unsupervised curriculum.”

Auxiliary set of problem statements

It proposes to set up an auxiliary set of problem statements that do not require proofs of varying difficulty. The team shows that if the difficulty of these auxiliary problems is varied enough, the procedure can solve a curriculum of increasingly difficult problems.


The team has built “lean-gym”, an REPL (read-eval-print-loop) interface for interacting with the Lean theorem prover. 

  • They have proposed an expert iteration methodology for GPT-f (Polu & Sutskever, 2020). The team showed that “at fixed compute budget, expert iteration outperforms proof search only.”
  • The team also presented a synthetic inequality generator and studied how expert iteration solves increasingly difficult problems from a set of generated statements of various difficulties.
  • OpenAI also presented a manually curated set of formalised problem statements and used it to achieve state-of-the-art performance on the miniF2F benchmark. 


Image: OpenAI

As per OpenAI, the team observed that the “capability to generate original mathematical terms required as arguments of tactics, which cannot be done without a neural language model, emerges from our training procedure.” This cannot be done without a neural language model that comes out from its training procedure.

The proof step uses n + 1 proposes that are generated by the research models to use n + 1 as a solution. The rest of it depends on the ring_exp tactic to verify the validity.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM