MITB Banner

Rethinking The Way We Benchmark Machine Learning Models

Share

“Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler.”

Wittgenstein’s ruler

Do machine learning researchers solve something huge every time they hit the benchmark? If not, then why do we have these benchmarks? Benchmarks indeed guide researchers and their research objectives. But, if the benchmark is breached every couple of months then research objectives might become more about chasing benchmarks than solving bigger problems. 

In order to address these challenges, researchers at Facebook AI have introduced Dynabench, a new platform for dynamic data collection and benchmarking. Dynabench can be used to collect human-in-the-loop data dynamically, against the current state-of-the-art, in a way that more accurately measures progress.

What’s Wrong With Current Benchmarks

Benchmarks are meant to challenge the ML community for longer durations. The rate at which AI expands can make existing benchmarks saturate quickly. With a new NLP model being released almost every two months, benchmarks fall back. 

Static benchmarking also lure researchers into overfitting their model to the benchmark. “Researchers have built lucrative careers from cranking out percentage-point improvements to claim “SOTA” on established benchmarks,” stated the researchers at Facebook. 

Added to this is the well-documented cases of inadvertent biases that may be present in datasets. For example, in a Q&A experiment, the answer to a “how much” or “how many” question is usually “2”. There might be unintended overlap between the train and test sets. Data biases are almost impossible to avoid, which may have very serious and potentially harmful side-effects.

Benchmarks are static for historical reasons. Up until recently, we did not have crowdsourcing platforms and the capability to serve large-scale models for inference. They were expensive to collect, took a long time to saturate, and models had a long way to go. Putting humans and models in the data collection loop together made little sense since models were simply too brittle. 

With recent advances, however, the Facebook researchers wrote, models are good enough to be put in the loop with humans, to measure the problem we really care about: how well can AI systems work together with humans.

Introducing Dynabench

The basic idea is that we collect data dynamically. Humans are tasked with finding adversarial examples that fool current state-of-the-art models. 

So, what does Dynabench actually do?

  • It allows researchers to measure how good the current SOTA methods really are
  • It yields data that may be used to further train even stronger SOTA models. 
  • The process is repeated over multiple rounds.
  • Each time a round gets “solved” by the SOTA, those models are used to collect a new dataset where they fail. 
  • Datasets will be released periodically as new examples are collected.

The key idea behind Dynabench is to leverage human creativity to challenge the models. Machines are nowhere close to comprehending language the way we humans do. In the case of Dynabench, suppose a language model is made to classify a review for sentiment analysis, the wit and hyperboles of language can fool the model. So, the human annotators add these adversarial examples until the model can be longer fooled. So, in a way, humans are continuously in the loop of the progress of machines, unlike the traditional benchmarking.

For each task in Dynabench, there will be multiple rounds of evaluation. According to the researchers, the models are served in the cloud, via torchserve. Crowdsourced annotators will be connected to the platform via Mephisto, and humans interacting with the model receive almost instantaneous feedback on the model’s response. They can employ tactics such as making the system focus on the wrong word and using clever references to real-world knowledge that the machine does not have access to.

That said, there are still risks such as catastrophic forgetting or cyclical “progress”, where improved models forget things that were relevant in an earlier round. “Research is required in trying to understand these shifts better, in characterising how it might impact learning, and in overcoming any adverse effects. Remember that Dynabench is a scientific experiment!” warned the researchers behind Dynabench.

Know more Dynabench here.

PS: The story was written using a keyboard.
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed