MITB Banner

Can We Crowdsource Benchmarks For Evaluating NLP Models?

Share

Recently, a team of researchers from Allen Institute for AI, University of Washington and Hebrew University of Jerusalem has introduced a new leaderboard for human-in-the-loop text generation benchmarking, known as GENIE. According to its developers, GENIE is a new benchmark for evaluating the generative natural language processing (NLP) models. The benchmark is claimed to enable model scoring by humans at scale.

We have seen an explosion of natural language processing (NLP) datasets aimed at testing various capabilities including machine comprehension, commonsense reasoning, summarisation etc in the last few years. 

Also, leaderboards such as GLUE benchmark have proven successful in promoting progress on a wide array of datasets and tasks. According to researchers, however, their adoption so far has been limited to task setups with reliable automatic evaluation, such as classification or span selection. 

Behind GENIE

GENIE is a collection of leaderboards for text-generation tasks backed with the human evaluation of text generation. GENIE works by posting model predictions to a crowdsourcing platform, where human annotators evaluate them according to predefined, dataset-specific guidelines. It is an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.

This new benchmark also presents various automatic metrics to examine how well they correlate with the human assessment scores. The researchers integrated popular datasets in English from four diverse tasks into GENIE, including machine translation, question answering, summarisation, and commonsense reasoning. 

Challenges

During the development of GENIE, the researchers faced several issues-

  • Each submission entailed a crowdsourcing fee, which might deter the submissions from researchers with limited resources. The researchers tackle this difficulty by keeping a single submission cost in the range of $100 and plan to pay for initial GENIE submissions from academic groups to further encourage participation.
  • One of the major concerns for a human-evaluation leaderboard is the reproducibility of human annotations over time and across various annotator populations. Thus, the researchers described the mechanisms introduced into GENIE, including estimating annotator variance and spreading the annotations across multiple days.

Advantages

  • GENIE provides text generation model developers with the ease of the “leaderboard experience,” alleviating the evaluation burden while ensuring high-quality, standardised comparison against previous models. 
  • The new benchmark facilitates the study of human evaluation interfaces, addressing challenges such as annotator training, inter-annotator agreement, and reproducibility.
  • GENIE helps developers of automatic evaluation metrics by serving as a hub of model submissions and associated human scores, which they can leverage to train or test against.

Contributions

  • Introduced GENIE, a new benchmark for evaluating generative NLP models that enables model scoring by humans at scale.
  • Released a public leaderboard for generative NLP tasks and formalised methods for converting crowd worker responses into model performance estimates and confidence intervals.
  • Conducted case studies on the human-in-the-loop leaderboard in terms of its reproducibility of results across time and correlation with expert judgements as a function of various design choices (number of collected annotations, scales of annotations, among others).
  • Established the human baseline results using state-of-the-art generative models for several popular tasks and demonstrated that these tasks are far from being solved.

Wrapping Up

The researchers stated, “To the best of our knowledge, GENIE is the first human-in-the-loop leaderboard to support controlled evaluation of a broad set of natural language tasks, and which is based on experiments aimed to ensure scoring reliability and design decisions for best-practice human evaluation templates.” Currently, the research is focused on the English language, mostly due to easy integration with crowdsourcing platforms. In the coming years, the researchers are hoping to integrate datasets from other languages.

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.