For a few years now, research on natural language understanding has focused on improving benchmark datasets, which feature roughly independent and identically distributed (IID) training, testing sections, validation, and drawn from data collected or annotated by crowdsourcing.
Sign up for your weekly dose of what's up in emerging technology.
The researchers stated: “Progress suffers in the absence of a trustworthy metric for benchmark-driven work: Newcomers and non-specialists are discouraged from trying to contribute, and specialists are given significant freedom to cherry-pick ad-hoc evaluation settings that mask a lack of progress.”
According to the researchers, unreliable as well as biased systems score much higher on the standard benchmarks. This left little room for researchers who want to develop better systems to demonstrate their improvements.
The concerns about the standard benchmarks which motivated methods like adversarial filtering are justified. However, they can and should be addressed directly, and it is possible and reasonable to do so in the context of static and IID evaluation, the paper argued.
Performance on popular benchmarks is exceptionally high. However, most of the time, researchers and experts can easily find issues with high-scoring models. This has become a serious concern among researchers. The paper proposed four key criteria that good benchmarks should satisfy.
The reason behind introducing the criteria is to build machines that can demonstrate a reliable and comprehensive understanding of the everyday natural language text in the context of the specific well-posed task, language variety, and topic domain.
Among the language understanding tasks, the researchers focused on those that use labelled The researchers focused on those that use labelled data and designed to test relatively general language understanding skills among the language understanding tasks. The design of such benchmarks can be challenging.
The researchers said effective future benchmarks for NLU tasks should satisfy these four challenges or criteria.
This criterion is difficult to formalise fully, and no simple test currently exists to determine if a benchmark presents a valid measure of model ability. The researchers identified three minimal requirements for a benchmark to meet this criterion:
- An evaluation dataset should reflect the full range of linguistic variation—including words and higher-level constructions—used in the relevant domain, context, and language variety.
- An evaluation dataset should have plausible means by which it tests all of the language-related behaviours expected from the model to show in the context of the task.
- An evaluation dataset should be sufficiently free of annotation artefacts. A system cannot reach near-human performance levels by any means other than demonstrating the required language-related behaviours.
2| Reliable annotation
The labels for the test examples should be reliably correct. and can only be achieved by avoiding three failure cases:
- Examples carelessly mislabeled
- Examples with no clear correct label due to unclear or underspecified task guidelines
- Examples with no clear correct label under the relevant metric due to legitimate disagreements in interpretation among annotators.
3| Statistical power
The benchmark evaluation datasets should be large and discriminative enough to detect any qualitatively relevant performance difference between the two models.
4| Disincentives for biased models
The researchers said a benchmark should, in general, favour a model without socially relevant biases over an otherwise equivalent model with such biases. Most of the current benchmarks fail this test as they are often built around the naturally occurring or crowdsourced text. It is usually the case that a system can improve its performance by adopting heuristics that reproduce potentially harmful biases.
The researchers stated benchmarking for natural language understanding (NLU) is broken. They argued that most of the current benchmarks fail at these four criteria, and adversarial data collection does not meaningfully address the causes of these failures.