Rethinking The Way We Benchmark Machine Learning Models

“Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler.”

Wittgenstein’s ruler

Do machine learning researchers solve something huge every time they hit the benchmark? If not, then why do we have these benchmarks? Benchmarks indeed guide researchers and their research objectives. But, if the benchmark is breached every couple of months then research objectives might become more about chasing benchmarks than solving bigger problems. 

In order to address these challenges, researchers at Facebook AI have introduced Dynabench, a new platform for dynamic data collection and benchmarking. Dynabench can be used to collect human-in-the-loop data dynamically, against the current state-of-the-art, in a way that more accurately measures progress.

What’s Wrong With Current Benchmarks

Benchmarks are meant to challenge the ML community for longer durations. The rate at which AI expands can make existing benchmarks saturate quickly. With a new NLP model being released almost every two months, benchmarks fall back. 


Sign up for your weekly dose of what's up in emerging technology.

Static benchmarking also lure researchers into overfitting their model to the benchmark. “Researchers have built lucrative careers from cranking out percentage-point improvements to claim “SOTA” on established benchmarks,” stated the researchers at Facebook. 

Added to this is the well-documented cases of inadvertent biases that may be present in datasets. For example, in a Q&A experiment, the answer to a “how much” or “how many” question is usually “2”. There might be unintended overlap between the train and test sets. Data biases are almost impossible to avoid, which may have very serious and potentially harmful side-effects.

Download our Mobile App

Benchmarks are static for historical reasons. Up until recently, we did not have crowdsourcing platforms and the capability to serve large-scale models for inference. They were expensive to collect, took a long time to saturate, and models had a long way to go. Putting humans and models in the data collection loop together made little sense since models were simply too brittle. 

With recent advances, however, the Facebook researchers wrote, models are good enough to be put in the loop with humans, to measure the problem we really care about: how well can AI systems work together with humans.

Introducing Dynabench

The basic idea is that we collect data dynamically. Humans are tasked with finding adversarial examples that fool current state-of-the-art models. 

So, what does Dynabench actually do?

  • It allows researchers to measure how good the current SOTA methods really are
  • It yields data that may be used to further train even stronger SOTA models. 
  • The process is repeated over multiple rounds.
  • Each time a round gets “solved” by the SOTA, those models are used to collect a new dataset where they fail. 
  • Datasets will be released periodically as new examples are collected.

The key idea behind Dynabench is to leverage human creativity to challenge the models. Machines are nowhere close to comprehending language the way we humans do. In the case of Dynabench, suppose a language model is made to classify a review for sentiment analysis, the wit and hyperboles of language can fool the model. So, the human annotators add these adversarial examples until the model can be longer fooled. So, in a way, humans are continuously in the loop of the progress of machines, unlike the traditional benchmarking.

For each task in Dynabench, there will be multiple rounds of evaluation. According to the researchers, the models are served in the cloud, via torchserve. Crowdsourced annotators will be connected to the platform via Mephisto, and humans interacting with the model receive almost instantaneous feedback on the model’s response. They can employ tactics such as making the system focus on the wrong word and using clever references to real-world knowledge that the machine does not have access to.

That said, there are still risks such as catastrophic forgetting or cyclical “progress”, where improved models forget things that were relevant in an earlier round. “Research is required in trying to understand these shifts better, in characterising how it might impact learning, and in overcoming any adverse effects. Remember that Dynabench is a scientific experiment!” warned the researchers behind Dynabench.

Know more Dynabench here.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

AIM Upcoming Events

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 10th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox