Active Hackathon

Google’s Latest Guidelines To Build Better NLU Benchmarks

Evaluation for many natural language understanding (NLU) tasks is broken.

Recently, researchers from Google Brain and New York University laid out four criteria to fix the issues plaguing natural language understanding  (NLU) benchmarks.

For a few years now, research on natural language understanding has focused on improving benchmark datasets, which feature roughly independent and identically distributed (IID) training, testing sections, validation, and drawn from data collected or annotated by crowdsourcing.


Sign up for your weekly dose of what's up in emerging technology.

The researchers stated: “Progress suffers in the absence of a trustworthy metric for benchmark-driven work: Newcomers and non-specialists are discouraged from trying to contribute, and specialists are given significant freedom to cherry-pick ad-hoc evaluation settings that mask a lack of progress.”

According to the researchers, unreliable as well as biased systems score much higher on the standard benchmarks. This left little room for researchers who want to develop better systems to demonstrate their improvements.

The concerns about the standard benchmarks which motivated methods like adversarial filtering are justified. However, they can and should be addressed directly, and it is possible and reasonable to do so in the context of static and IID evaluation, the paper argued.

The criteria

Performance on popular benchmarks is exceptionally high. However, most of the time, researchers and experts can easily find issues with high-scoring models. This has become a serious concern among researchers. The paper proposed four key criteria that good benchmarks should satisfy. 

The reason behind introducing the criteria is to build machines that can demonstrate a reliable and comprehensive understanding of the everyday natural language text in the context of the specific well-posed task, language variety, and topic domain.

Among the language understanding tasks, the researchers focused on those that use labelled The researchers focused on those that use labelled data and designed to test relatively general language understanding skills among the language understanding tasks. The design of such benchmarks can be challenging.

The researchers said effective future benchmarks for NLU tasks should satisfy these four challenges or criteria.

1| Validity

This criterion is difficult to formalise fully, and no simple test currently exists to determine if a benchmark presents a valid measure of model ability. The researchers identified three minimal requirements for a benchmark to meet this criterion:

  • An evaluation dataset should reflect the full range of linguistic variation—including words and higher-level constructions—used in the relevant domain, context, and language variety.
  • An evaluation dataset should have plausible means by which it tests all of the language-related behaviours expected from the model to show in the context of the task.
  • An evaluation dataset should be sufficiently free of annotation artefacts. A system cannot reach near-human performance levels by any means other than demonstrating the required language-related behaviours.

2| Reliable annotation

The labels for the test examples should be reliably correct. and can only be achieved by avoiding three failure cases:

  • Examples carelessly mislabeled
  • Examples with no clear correct label due to unclear or underspecified task guidelines
  • Examples with no clear correct label under the relevant metric due to legitimate disagreements in interpretation among annotators. 

3| Statistical power

The benchmark evaluation datasets should be large and discriminative enough to detect any qualitatively relevant performance difference between the two models.

4| Disincentives for biased models

The researchers said a benchmark should, in general, favour a model without socially relevant biases over an otherwise equivalent model with such biases. Most of the current benchmarks fail this test as they are often built around the naturally occurring or crowdsourced text. It is usually the case that a system can improve its performance by adopting heuristics that reproduce potentially harmful biases.

Rounding up

The researchers stated benchmarking for natural language understanding (NLU) is broken. They argued that most of the current benchmarks fail at these four criteria, and adversarial data collection does not meaningfully address the causes of these failures.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM