Rethinking The Way We Benchmark Machine Learning Models

“Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler.”

Wittgenstein’s ruler

Do machine learning researchers solve something huge every time they hit the benchmark? If not, then why do we have these benchmarks? Benchmarks indeed guide researchers and their research objectives. But, if the benchmark is breached every couple of months then research objectives might become more about chasing benchmarks than solving bigger problems. 

In order to address these challenges, researchers at Facebook AI have introduced Dynabench, a new platform for dynamic data collection and benchmarking. Dynabench can be used to collect human-in-the-loop data dynamically, against the current state-of-the-art, in a way that more accurately measures progress.


Sign up for your weekly dose of what's up in emerging technology.

What’s Wrong With Current Benchmarks

Benchmarks are meant to challenge the ML community for longer durations. The rate at which AI expands can make existing benchmarks saturate quickly. With a new NLP model being released almost every two months, benchmarks fall back. 

Static benchmarking also lure researchers into overfitting their model to the benchmark. “Researchers have built lucrative careers from cranking out percentage-point improvements to claim “SOTA” on established benchmarks,” stated the researchers at Facebook. 

Added to this is the well-documented cases of inadvertent biases that may be present in datasets. For example, in a Q&A experiment, the answer to a “how much” or “how many” question is usually “2”. There might be unintended overlap between the train and test sets. Data biases are almost impossible to avoid, which may have very serious and potentially harmful side-effects.

Benchmarks are static for historical reasons. Up until recently, we did not have crowdsourcing platforms and the capability to serve large-scale models for inference. They were expensive to collect, took a long time to saturate, and models had a long way to go. Putting humans and models in the data collection loop together made little sense since models were simply too brittle. 

With recent advances, however, the Facebook researchers wrote, models are good enough to be put in the loop with humans, to measure the problem we really care about: how well can AI systems work together with humans.

Introducing Dynabench

The basic idea is that we collect data dynamically. Humans are tasked with finding adversarial examples that fool current state-of-the-art models. 

So, what does Dynabench actually do?

  • It allows researchers to measure how good the current SOTA methods really are
  • It yields data that may be used to further train even stronger SOTA models. 
  • The process is repeated over multiple rounds.
  • Each time a round gets “solved” by the SOTA, those models are used to collect a new dataset where they fail. 
  • Datasets will be released periodically as new examples are collected.

The key idea behind Dynabench is to leverage human creativity to challenge the models. Machines are nowhere close to comprehending language the way we humans do. In the case of Dynabench, suppose a language model is made to classify a review for sentiment analysis, the wit and hyperboles of language can fool the model. So, the human annotators add these adversarial examples until the model can be longer fooled. So, in a way, humans are continuously in the loop of the progress of machines, unlike the traditional benchmarking.

For each task in Dynabench, there will be multiple rounds of evaluation. According to the researchers, the models are served in the cloud, via torchserve. Crowdsourced annotators will be connected to the platform via Mephisto, and humans interacting with the model receive almost instantaneous feedback on the model’s response. They can employ tactics such as making the system focus on the wrong word and using clever references to real-world knowledge that the machine does not have access to.

That said, there are still risks such as catastrophic forgetting or cyclical “progress”, where improved models forget things that were relevant in an earlier round. “Research is required in trying to understand these shifts better, in characterising how it might impact learning, and in overcoming any adverse effects. Remember that Dynabench is a scientific experiment!” warned the researchers behind Dynabench.

Know more Dynabench here.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.