AI Researchers Make A Case For Better Benchmarks In AI

Stanford University recently released the 2021 AI Index, highlighting major trends and advancements in artificial intelligence. The fourth edition of the report talked about technology’s impact on society, education, and policy and outlined the progress made in other AI subdomains such as deep learning, object detection, NLP, etc.

The highlights from the 2021 report included AI research citations, AI startup fundings, and growing conversation around AI ethics. One of the more significant observations made in the report was about the need for more and better benchmarks in AI and other related fields such as ethics, NLP, and computer vision.

“We’re running out of tests as fast as we can build them,” said Jack Clark, head of an OECD group working on algorithm impact assessment and former policy director for OpenAI.

What Are Benchmarks?

Benchmarks check the worthiness of a system to be deployed for real-time situations. They provide a reliable, transparent, standardised approach to gauge performance with different parameters for handling a workload. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

So while a task and the metrics associated with a model can be thought of as an abstraction of the problem at hand, benchmark datasets provide fixed representations of tasks to be solved by a model.

Benchmarking is an important driver for research and innovation. Experts such as David Patterson, author of Computer Architecture: A Qualitative Approach, believe that good benchmarks help researchers compare ideas quickly, which results in better innovation.

Download our Mobile App

Growing Need For Better Benchmarks

Research and development in AI are happening at lightning speed. As a result, benchmarks are getting saturated quickly. For instance, new models are released every month in NLP, and the previously held benchmark falls short, leading to overfitting. 

However, the good news is that the open-source movement and increased collaboration between the researchers’ community have led to better AI/ML benchmarks.

A good benchmark has several purposes. 

  • For beginners, benchmarks help in sailing through new terms and data.
  • For experienced researchers, benchmarks offer a quick-to-collect baseline. Any disagreement between the benchmark and specific measurements of the model can help identify areas of improvement.
  • For users and solution providers, benchmarks help in estimating the developmental costs of infrastructure.

Representative benchmarks allow engineering efforts to be focussed on high-value and widely used targets. Benchmarks help optimise the system and ensure improved value and RoI for all the stakeholders–manufacturers, users, researchers, consultants, and analysts.

The attributes of good AI/ML benchmarks:

  • The use of relevant metrics is critical. A 2020 study conducted on 3,000 research papers available on Papers with Code found that most of them used common metrics. ‘Accuracy’ was the most common metric, appearing across 38 percent of the benchmark data sets. The drawback with this is that the results could be uninformative, unuseful, and sometimes irrelevant.
  • A good benchmark suite consists of diverse and representative workloads. This helps in covering a large fraction of the application space.
  • The benchmarks chosen should be in keeping with the recent problem. In such cases, a fixed benchmark suite quickly becomes obsolete. This calls for rapid iterations, which allows a benchmark suite to remain relevant.
  • A good benchmark suite should support repeatability regardless of where an experiment is conducted.
  • A benchmark test should be scalable. 

AI Ethics & Benchmarks

The 2021 AI Index also noted that despite the growing conversation around AI ethics and related domains, the field significantly lacks benchmarks to measure or assess relationships between technologies and their impact on society. Citing an example of a study by the National Institute of Standards and Technology on facial recognition performance focusing on bias, the report said while it is a challenge to create more data and relevant benchmarks, it is still an important area to focus on. “Policymakers are keenly aware of ethical concerns pertaining to AI, but it is easier for them to manage what they can measure, so finding ways to translate qualitative arguments into quantitative data is an essential step in the process,” the report stated.

Sign up for The AI Forum for India

Analytics India Magazine is excited to announce the launch of AI Forum for India – a community, created in association with NVIDIA, aimed at fostering collaboration and growth within the artificial intelligence (AI) industry in India.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

GPT-4: Beyond Magical Mystery

The OpenAI CEO believes that by ingesting human knowledge, the model is acquiring a form of reasoning capability that could be additive to human wisdom in some senses.