MITB Banner

Facebook Launches An Evaluation-As-A-Service Framework For ML Models

Share

Last year, Facebook AI introducedDynabench, a platform for dynamic data collection and benchmarking that uses humans and NLP models to create challenging test datasets. The humans are tasked with finding adversarial examples that fool current state-of-the-art models.

Facebook has recently updated Dynabench with Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic NLP model comparison. With Dynaboard, you can perform apples-to-apples comparisons dynamically without problems from bugs in evaluation code, backward compatibility, inconsistencies in filtering test data, accessibility, and other reproducibility issues. 

Facebook is looking to push the industry towards a more rigorous, real-world evaluation of NLP models. It enables researchers to customise a new ‘Dynascore’ metric based on accuracy, memory, compute, robustness and fairness. 

Why Dynaboard? 

Every person who uses leaderboards has a different set of preferences and goals. Dynascore evaluates the performance in a nuanced, comprehensive way.

For instance, even a 10x more accurate NLP model may be useless to an embedded systems engineer if it’s untenably large and slow. At the same time, a very fast, accurate model shouldn’t be considered high-performing if it doesn’t work smoothly for everyone. “AI researchers need to be able to make informed decisions about the tradeoffs of using a particular model,” said Facebook. 

So far, benchmarks such as MNIST, ImageNet, SquAD, SNLI and GLUE have played a crucial role in driving progress in AI research. But, they seem to be changing rapidly. Therefore, every time a new benchmark is introduced, researchers chase them instead of solving a persisting problem. That, in a way, hinders the progress of research.

(Source: Facebook AI) 

In the last few years, the benchmarks have been saturating rapidly, especially in NLP. For instance, if you look at the above visuals, it took the research community 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet. In contrast, it took about a year to beat humans on the GLUE benchmark for language understanding. 

“Our journey is just getting started,” said Facebook, stating since the launch of Dynabench, it has collected over 400,000 examples and has released two new, challenging datasets. “Now, we have adversarial benchmarks for all four of our initial official tasks within Dynabench, which initially focus on language understanding.” 

As part of the initial experiment, Facebook has used Dynaboard to rank current SOTA NLP models, including BERT, RoBERTa, ALBERT, T5, and DeBERTa, on the four core Dynabench tasks. 

How does it work? 

Dynascore allows researchers to tailor an evaluation by placing greater or less emphasis on a collection of tests. 

While the performance of models is in place, Dynabench tracks which examples fool the models and lead to incorrect predictions across the core tasks of natural language interference, question answering, hate speech and sentiment analysis.

Facebook said these examples further improve the systems and become part of more challenging datasets that train new models, which can be benchmarked to create a virtuous cycle of research progress. The way it works is, crowdsourced annotators connect to Dynabench and receive feedback on a model’s response. If annotators disagree with the original label, the example is discarded from the test set.

Checking the impact of the MLP model’s performance by adjusting various metrics in Dynaboard (Source: Facebook AI) 

Tech behind Dynascore 

With Dynaboard, Facebook has managed to combine multiple metrics into a single score to rank models rather than rely on traditional disparate, static metrics because a static leaderboard’s ranking cannot approximate researchers’ preferences.

“Our approach is to borrow from microeconomics theory to find the ‘exchange rate’ between metrics that can standardise units across metrics, after which a weighted average is taken to calculate the Dynascore,” said Facebook. As the user adjusts the weights to approximate their utility function better, the models would be dynamically re-ranked in real-time. 

Furthermore, Facebook has used the formula called the marginal rate of substitution (MRS) to compute the rate at which the adjustments or tradeoffs are made. In economics, MRS is the amount of good that a consumer is willing to give up for another good while getting the same utility. 

Fairness as a metric  

The AI community is still in the early stages of understanding the challenges of fairness and potential algorithmic bias. There is no single, widely agreed definition of fairness. “Similar to the measurement of robustness, as a first version, we perform perturbations of original datasets by changing, for instance, noun phrase gender (replacing ‘sister’ with ‘brother,’ or ‘he’ with ‘they’) and by substituting names with others that are statistically predictive of another race or ethnicity,” explained Facebook. 

In the Dynaboard scoring, an NLP model is considered more ‘fair’ if its predictions don’t change after such a perturbation. “Although approaches like ours that replace words have become a common method in NLP for measuring fairness, this metric is far from perfect,” said Facebook. 

Heuristically replacing ‘his’ with ‘hers’ or ‘her’ makes sense given English grammar but sometimes results in mistakes. For example, replacing ‘his’ with ‘her’ in a sentence such as ‘this cat is his,’ the answer would be ‘this cat is her,’ which doesn’t maintain the same meaning. 

Facebook believes the AI community will build on these capabilities and progress on devising better metrics for specific contexts for evaluating relevant dimensions of ‘fairness’ in the future. 

“We hope Dynabench will help the AI community build systems that make fewer mistakes, are less subject to potentially harmful biases, and are more useful and beneficial to people in the real world,” concluded Facebook. 

Share
Picture of Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.