MITB Banner

Why Tech Giants Introduced Their Own Benchmark for NLP

Share

Recently, 55 researchers from more than 40 research labs and academia (including Intel, Microsoft, Google Research) worldwide came together to introduce GEM, a benchmark for natural language generation (NLG).

GEM is claimed to be a living benchmark for evaluation of NLG models. The benchmark allows models to be implemented in a large set of corpora and to test evaluation strategies.

A subset of natural language processing (NLP), NLG automatically generates texts, with the help of a non-linguistic or textual representation of information as input. A natural language generation system needs to be robust to make shifts in the data distribution and produce text in many different languages. 

According to the researchers, it is often desired that repeated interactions with the model produce diverse outputs. For instance, to explain concepts in multiple ways or to become a more engaging conversational agent. However, all these optimisation objectives can often be conflicting and, as a result, evaluations that focus on a single aspect may fail to recognise the drawbacks of a particular method.

Specifically, measuring progress in NLG model depends on an evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to these evolving targets, new natural language models often still evaluate divergent anglo-centric corpora with well-established, however, flawed metrics. This makes it challenging to identify the limitations of current models and opportunities for progress.

Tech Behind GEM

GEM (Generation, Evaluation, and Metrics) aims to enable research on a wide range of natural language generation challenges. GEM focuses on an in-depth evaluation of natural language model outputs across human and automatic evaluation to uncover shortcomings and opportunities for progress. 

The researchers said, “As datasets, metrics, and models improve, the benchmark environment will improve as well, replacing “solved” tasks with more challenging ones, incorporating newly developed metrics, and addressing discovered flaws in the experimental setup.” The adoption

and increase in the robustness of model evaluations can only be achieved by making all model outputs available under an open-source license.

GEM includes an initial set of eleven datasets that measure specific generation challenges, such as content selection and planning, surface realisation, paraphrasing, simplification, etc. The datasets include CommonGEN, E2E clean, MLSum, ToTTo, WebNLG, among others. 

GEM datasets also differ in aspects such as communicative goals, the noisiness of data, resource availability and languages to evaluate the consistency of evaluation schemes. The size of the datasets ranges from 5k to 500k data points. 

At present, GEM supports seven languages across all tasks. To assess the performance of the models, the researchers introduced challenging test sets that probe for specific modelling aspects.

Benefits of GEM

  • GEM measures NLG progress across 13 datasets spanning many NLG tasks and languages.
  • The benchmark provides an in-depth analysis of data and models presented via data statements and challenge sets.
  • GEM develops standards for evaluation of generated text using both automated and human metrics.
  • GEM can be availed to develop reproducible and consistent human evaluation strategies for generated text.

Wrapping Up

According to the researchers, GEM provides a testbed for automated metrics and can be used to popularise newly developed ones. The benchmark aims to measure the progress in NLG without misrepresenting the complex interactions between the sometimes contradicting measures. 

On a concluding note, the researchers stated, “By providing a testbed to easily conduct experiments across many datasets and evaluate in a repeatable, consistent, and more interpretable way, we will be able to track progress toward the goals in NLG research much more clearly.

Moreover, we will be able to extend and shape GEM in the future to include more multilingual datasets, which will assist in their adoption across the broader research community.”

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.