Now Reading
Rethinking The Way We Benchmark Machine Learning Models

Rethinking The Way We Benchmark Machine Learning Models

“Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler.”

Wittgenstein’s ruler

Do machine learning researchers solve something huge every time they hit the benchmark? If not, then why do we have these benchmarks? Benchmarks indeed guide researchers and their research objectives. But, if the benchmark is breached every couple of months then research objectives might become more about chasing benchmarks than solving bigger problems. 

In order to address these challenges, researchers at Facebook AI have introduced Dynabench, a new platform for dynamic data collection and benchmarking. Dynabench can be used to collect human-in-the-loop data dynamically, against the current state-of-the-art, in a way that more accurately measures progress.

Register for our Workshop on How To Start Your Career In Data Science?

What’s Wrong With Current Benchmarks

Benchmarks are meant to challenge the ML community for longer durations. The rate at which AI expands can make existing benchmarks saturate quickly. With a new NLP model being released almost every two months, benchmarks fall back. 

Static benchmarking also lure researchers into overfitting their model to the benchmark. “Researchers have built lucrative careers from cranking out percentage-point improvements to claim “SOTA” on established benchmarks,” stated the researchers at Facebook. 

Added to this is the well-documented cases of inadvertent biases that may be present in datasets. For example, in a Q&A experiment, the answer to a “how much” or “how many” question is usually “2”. There might be unintended overlap between the train and test sets. Data biases are almost impossible to avoid, which may have very serious and potentially harmful side-effects.

Benchmarks are static for historical reasons. Up until recently, we did not have crowdsourcing platforms and the capability to serve large-scale models for inference. They were expensive to collect, took a long time to saturate, and models had a long way to go. Putting humans and models in the data collection loop together made little sense since models were simply too brittle. 

With recent advances, however, the Facebook researchers wrote, models are good enough to be put in the loop with humans, to measure the problem we really care about: how well can AI systems work together with humans.

Introducing Dynabench

The basic idea is that we collect data dynamically. Humans are tasked with finding adversarial examples that fool current state-of-the-art models. 

So, what does Dynabench actually do?

See Also

  • It allows researchers to measure how good the current SOTA methods really are
  • It yields data that may be used to further train even stronger SOTA models. 
  • The process is repeated over multiple rounds.
  • Each time a round gets “solved” by the SOTA, those models are used to collect a new dataset where they fail. 
  • Datasets will be released periodically as new examples are collected.

The key idea behind Dynabench is to leverage human creativity to challenge the models. Machines are nowhere close to comprehending language the way we humans do. In the case of Dynabench, suppose a language model is made to classify a review for sentiment analysis, the wit and hyperboles of language can fool the model. So, the human annotators add these adversarial examples until the model can be longer fooled. So, in a way, humans are continuously in the loop of the progress of machines, unlike the traditional benchmarking.

For each task in Dynabench, there will be multiple rounds of evaluation. According to the researchers, the models are served in the cloud, via torchserve. Crowdsourced annotators will be connected to the platform via Mephisto, and humans interacting with the model receive almost instantaneous feedback on the model’s response. They can employ tactics such as making the system focus on the wrong word and using clever references to real-world knowledge that the machine does not have access to.

That said, there are still risks such as catastrophic forgetting or cyclical “progress”, where improved models forget things that were relevant in an earlier round. “Research is required in trying to understand these shifts better, in characterising how it might impact learning, and in overcoming any adverse effects. Remember that Dynabench is a scientific experiment!” warned the researchers behind Dynabench.

Know more Dynabench here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top