Last year, Facebook AI introducedDynabench, a platform for dynamic data collection and benchmarking that uses humans and NLP models to create challenging test datasets. The humans are tasked with finding adversarial examples that fool current state-of-the-art models.
Facebook has recently updated Dynabench with Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic NLP model comparison. With Dynaboard, you can perform apples-to-apples comparisons dynamically without problems from bugs in evaluation code, backward compatibility, inconsistencies in filtering test data, accessibility, and other reproducibility issues.
Facebook is looking to push the industry towards a more rigorous, real-world evaluation of NLP models. It enables researchers to customise a new ‘Dynascore’ metric based on accuracy, memory, compute, robustness and fairness.
Why Dynaboard?
Every person who uses leaderboards has a different set of preferences and goals. Dynascore evaluates the performance in a nuanced, comprehensive way.
For instance, even a 10x more accurate NLP model may be useless to an embedded systems engineer if it’s untenably large and slow. At the same time, a very fast, accurate model shouldn’t be considered high-performing if it doesn’t work smoothly for everyone. “AI researchers need to be able to make informed decisions about the tradeoffs of using a particular model,” said Facebook.
So far, benchmarks such as MNIST, ImageNet, SquAD, SNLI and GLUE have played a crucial role in driving progress in AI research. But, they seem to be changing rapidly. Therefore, every time a new benchmark is introduced, researchers chase them instead of solving a persisting problem. That, in a way, hinders the progress of research.
In the last few years, the benchmarks have been saturating rapidly, especially in NLP. For instance, if you look at the above visuals, it took the research community 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet. In contrast, it took about a year to beat humans on the GLUE benchmark for language understanding.
“Our journey is just getting started,” said Facebook, stating since the launch of Dynabench, it has collected over 400,000 examples and has released two new, challenging datasets. “Now, we have adversarial benchmarks for all four of our initial official tasks within Dynabench, which initially focus on language understanding.”
As part of the initial experiment, Facebook has used Dynaboard to rank current SOTA NLP models, including BERT, RoBERTa, ALBERT, T5, and DeBERTa, on the four core Dynabench tasks.
How does it work?
Dynascore allows researchers to tailor an evaluation by placing greater or less emphasis on a collection of tests.
While the performance of models is in place, Dynabench tracks which examples fool the models and lead to incorrect predictions across the core tasks of natural language interference, question answering, hate speech and sentiment analysis.
Facebook said these examples further improve the systems and become part of more challenging datasets that train new models, which can be benchmarked to create a virtuous cycle of research progress. The way it works is, crowdsourced annotators connect to Dynabench and receive feedback on a model’s response. If annotators disagree with the original label, the example is discarded from the test set.
Checking the impact of the MLP model’s performance by adjusting various metrics in Dynaboard (Source: Facebook AI)
Tech behind Dynascore
With Dynaboard, Facebook has managed to combine multiple metrics into a single score to rank models rather than rely on traditional disparate, static metrics because a static leaderboard’s ranking cannot approximate researchers’ preferences.
“Our approach is to borrow from microeconomics theory to find the ‘exchange rate’ between metrics that can standardise units across metrics, after which a weighted average is taken to calculate the Dynascore,” said Facebook. As the user adjusts the weights to approximate their utility function better, the models would be dynamically re-ranked in real-time.
Furthermore, Facebook has used the formula called the marginal rate of substitution (MRS) to compute the rate at which the adjustments or tradeoffs are made. In economics, MRS is the amount of good that a consumer is willing to give up for another good while getting the same utility.
Fairness as a metric
The AI community is still in the early stages of understanding the challenges of fairness and potential algorithmic bias. There is no single, widely agreed definition of fairness. “Similar to the measurement of robustness, as a first version, we perform perturbations of original datasets by changing, for instance, noun phrase gender (replacing ‘sister’ with ‘brother,’ or ‘he’ with ‘they’) and by substituting names with others that are statistically predictive of another race or ethnicity,” explained Facebook.
In the Dynaboard scoring, an NLP model is considered more ‘fair’ if its predictions don’t change after such a perturbation. “Although approaches like ours that replace words have become a common method in NLP for measuring fairness, this metric is far from perfect,” said Facebook.
Heuristically replacing ‘his’ with ‘hers’ or ‘her’ makes sense given English grammar but sometimes results in mistakes. For example, replacing ‘his’ with ‘her’ in a sentence such as ‘this cat is his,’ the answer would be ‘this cat is her,’ which doesn’t maintain the same meaning.
Facebook believes the AI community will build on these capabilities and progress on devising better metrics for specific contexts for evaluating relevant dimensions of ‘fairness’ in the future.
“We hope Dynabench will help the AI community build systems that make fewer mistakes, are less subject to potentially harmful biases, and are more useful and beneficial to people in the real world,” concluded Facebook.