Active Hackathon

Why Tech Giants Introduced Their Own Benchmark for NLP

Recently, 55 researchers from more than 40 research labs and academia (including Intel, Microsoft, Google Research) worldwide came together to introduce GEM, a benchmark for natural language generation (NLG).

GEM is claimed to be a living benchmark for evaluation of NLG models. The benchmark allows models to be implemented in a large set of corpora and to test evaluation strategies.


Sign up for your weekly dose of what's up in emerging technology.

A subset of natural language processing (NLP), NLG automatically generates texts, with the help of a non-linguistic or textual representation of information as input. A natural language generation system needs to be robust to make shifts in the data distribution and produce text in many different languages. 

According to the researchers, it is often desired that repeated interactions with the model produce diverse outputs. For instance, to explain concepts in multiple ways or to become a more engaging conversational agent. However, all these optimisation objectives can often be conflicting and, as a result, evaluations that focus on a single aspect may fail to recognise the drawbacks of a particular method.

Specifically, measuring progress in NLG model depends on an evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to these evolving targets, new natural language models often still evaluate divergent anglo-centric corpora with well-established, however, flawed metrics. This makes it challenging to identify the limitations of current models and opportunities for progress.

Tech Behind GEM

GEM (Generation, Evaluation, and Metrics) aims to enable research on a wide range of natural language generation challenges. GEM focuses on an in-depth evaluation of natural language model outputs across human and automatic evaluation to uncover shortcomings and opportunities for progress. 

The researchers said, “As datasets, metrics, and models improve, the benchmark environment will improve as well, replacing “solved” tasks with more challenging ones, incorporating newly developed metrics, and addressing discovered flaws in the experimental setup.” The adoption

and increase in the robustness of model evaluations can only be achieved by making all model outputs available under an open-source license.

GEM includes an initial set of eleven datasets that measure specific generation challenges, such as content selection and planning, surface realisation, paraphrasing, simplification, etc. The datasets include CommonGEN, E2E clean, MLSum, ToTTo, WebNLG, among others. 

GEM datasets also differ in aspects such as communicative goals, the noisiness of data, resource availability and languages to evaluate the consistency of evaluation schemes. The size of the datasets ranges from 5k to 500k data points. 

At present, GEM supports seven languages across all tasks. To assess the performance of the models, the researchers introduced challenging test sets that probe for specific modelling aspects.

Benefits of GEM

  • GEM measures NLG progress across 13 datasets spanning many NLG tasks and languages.
  • The benchmark provides an in-depth analysis of data and models presented via data statements and challenge sets.
  • GEM develops standards for evaluation of generated text using both automated and human metrics.
  • GEM can be availed to develop reproducible and consistent human evaluation strategies for generated text.

Wrapping Up

According to the researchers, GEM provides a testbed for automated metrics and can be used to popularise newly developed ones. The benchmark aims to measure the progress in NLG without misrepresenting the complex interactions between the sometimes contradicting measures. 

On a concluding note, the researchers stated, “By providing a testbed to easily conduct experiments across many datasets and evaluate in a repeatable, consistent, and more interpretable way, we will be able to track progress toward the goals in NLG research much more clearly.

Moreover, we will be able to extend and shape GEM in the future to include more multilingual datasets, which will assist in their adoption across the broader research community.”

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter