Active Hackathon

Finally, a language model that does Maths

Minerva is built on Pathways Language Model (PaLM) with extended training on a 118GB dataset of scientific papers from arXiv and 38.5B tokens of mathematical data derived from web pages.
Listen to this story

A bunch of large language models burst into the scene this year, with applications ranging from automated code generation to text to image generation. However, these LLMs have come up short on the quantitative reasoning front.

Google has broken this barrier with their latest language model, Minerva. Named after the Roman goddess of wisdom, it is trained on a high-quality scientific and mathematical dataset.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Minerva in a nutshell

Minerva is built on Pathways Language Model (PaLM) with extended training on a 118GB dataset of scientific papers from arXiv and 38.5B tokens of mathematical data derived from web pages. The model processes scientific and mathematical questions formed in natural language and generates a step-by-step solution with the help of Latex Notations, MathJax, or other mathematical typesetting formats. 

The model is developed in 3 baseline forms:  8B, 62B, and 540B parameter pretrained models. Along with an extended pretraining on mathematical data, Minerva also incorporates prompting and evaluation techniques like chain-of-thought, scratchpad and majority voting to provide a step-by-step evaluation process and choose the optimal result out from a sample of multiple solutions.

STEM benchmarks

Minerva’s quantitative reasoning capabilities were tested on STEM benchmarks, the standard of assessment in the United States education system. The level of difficulty ranges from grade school level to graduate-level coursework.

  • MATH: A dataset of 12K middle school and high school math problems written in LATEX format. The models are prompted with a fixed four-shot prompt consisting of four random examples whose ground truth targets are not too long.
  • MMLU-STEM: Focused on STEM, this subset of the Massive Multitask Language Understanding benchmark covers topics such as engineering, chemistry, math, and physics at the high school and college levels. In this test, a combination of five-shot, chain-of-thought and a multiple-choice version of the MATH prompt is used to tackle various problems.
  • GSM8k: Grade school-level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student. The model is evaluated using chain-of-thought prompting. However, in this test, no external tools are used for calculations.
  • OCWCourses: A set of problems covering STEM topics ranging from differential equations, solid-state chemistry, astronomy, special relativity, etc., at an undergraduate level. The dataset was created using publicly-available course materials offered by MIT (OpenCourseWare). Only problems with automatically-verifiable solutions (either numeric or symbolically verifiable via SymPy) from various courses were included. 

Minerva outperformed SOTA results by a wide margin.

a table of comparison between Minerva, PaLM and davinci-002

Source: arxiv.org

Accuracy of Minerva on various subtopics

Image: Accuracy on LMs on MATH and MMLU

The Minerva model was also tested on the National Math exam in Poland in 2021. While the 62B baseline model achieved 57%, which was the national average that year, the 540B baseline variant achieved 65%. 

What’s the big deal?

Quantitative reasoning is the ability to use mathematics and information to solve real-world problems. OpenAI’s GPT-3 could only complete 2.9% to 6.9% of problems from a dataset of over 12,500 in the year 2021. Later, they launched GPT-f. The automated prover and proof assistant for the Metamath formalization language was the first machine learning-based system that contributed proofs that were adopted by a formal mathematics community.

Guillaume Lample and François Charton at Facebook AI Research have come up with a trained neural network capable of symbolic reasoning for differential and integral equations.

Minerva is trained on a large dataset that combines natural language understanding with the correct use of formal mathematical language (equations and diagrams). The model establishes a new baseline for quantitative reasoning benchmarks by increasing data quality and model size.

Sample problem solved by Minerva

Source: arxiv.org

The researchers said one of the direct applications could be the creation of an accessible and affordable AI-based math tutor.

Not a perfect model

Despite training on an extensive dataset of mathematical data, Minerva is far from a perfect problem solver. Upon analyzing the sample of problems that the model got wrong, a pattern was found.

  1. About half of the problems were calculation errors, while the other half were solution steps that did not follow a logical chain of thought.
  2. Although the model arrived at the right answer, it did not use the correct reasoning. Such cases were referred to as ‘false positives. However, the rate of false positives was relatively low.

The model doesn’t have access to external tools like a calculator or a Python interpreter, limiting its ability to handle tasks that require complicated numerical calculations. Check out the demo of Minerva explorer.

More Great AIM Stories

Kartik Wali
A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]