Recently, 55 researchers from more than 40 research labs and academia (including Intel, Microsoft, Google Research) worldwide came together to introduce GEM, a benchmark for natural language generation (NLG).
GEM is claimed to be a living benchmark for evaluation of NLG models. The benchmark allows models to be implemented in a large set of corpora and to test evaluation strategies.
A subset of natural language processing (NLP), NLG automatically generates texts, with the help of a non-linguistic or textual representation of information as input. A natural language generation system needs to be robust to make shifts in the data distribution and produce text in many different languages.
Sign up for your weekly dose of what's up in emerging technology.
According to the researchers, it is often desired that repeated interactions with the model produce diverse outputs. For instance, to explain concepts in multiple ways or to become a more engaging conversational agent. However, all these optimisation objectives can often be conflicting and, as a result, evaluations that focus on a single aspect may fail to recognise the drawbacks of a particular method.
Specifically, measuring progress in NLG model depends on an evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to these evolving targets, new natural language models often still evaluate divergent anglo-centric corpora with well-established, however, flawed metrics. This makes it challenging to identify the limitations of current models and opportunities for progress.
Tech Behind GEM
GEM (Generation, Evaluation, and Metrics) aims to enable research on a wide range of natural language generation challenges. GEM focuses on an in-depth evaluation of natural language model outputs across human and automatic evaluation to uncover shortcomings and opportunities for progress.
The researchers said, “As datasets, metrics, and models improve, the benchmark environment will improve as well, replacing “solved” tasks with more challenging ones, incorporating newly developed metrics, and addressing discovered flaws in the experimental setup.” The adoption
and increase in the robustness of model evaluations can only be achieved by making all model outputs available under an open-source license.
GEM includes an initial set of eleven datasets that measure specific generation challenges, such as content selection and planning, surface realisation, paraphrasing, simplification, etc. The datasets include CommonGEN, E2E clean, MLSum, ToTTo, WebNLG, among others.
GEM datasets also differ in aspects such as communicative goals, the noisiness of data, resource availability and languages to evaluate the consistency of evaluation schemes. The size of the datasets ranges from 5k to 500k data points.
At present, GEM supports seven languages across all tasks. To assess the performance of the models, the researchers introduced challenging test sets that probe for specific modelling aspects.
Benefits of GEM
- GEM measures NLG progress across 13 datasets spanning many NLG tasks and languages.
- The benchmark provides an in-depth analysis of data and models presented via data statements and challenge sets.
- GEM develops standards for evaluation of generated text using both automated and human metrics.
- GEM can be availed to develop reproducible and consistent human evaluation strategies for generated text.
According to the researchers, GEM provides a testbed for automated metrics and can be used to popularise newly developed ones. The benchmark aims to measure the progress in NLG without misrepresenting the complex interactions between the sometimes contradicting measures.
On a concluding note, the researchers stated, “By providing a testbed to easily conduct experiments across many datasets and evaluate in a repeatable, consistent, and more interpretable way, we will be able to track progress toward the goals in NLG research much more clearly.
Moreover, we will be able to extend and shape GEM in the future to include more multilingual datasets, which will assist in their adoption across the broader research community.”
Read the paper here.