Can We Crowdsource Benchmarks For Evaluating NLP Models?

Recently, a team of researchers from Allen Institute for AI, University of Washington and Hebrew University of Jerusalem has introduced a new leaderboard for human-in-the-loop text generation benchmarking, known as GENIE. According to its developers, GENIE is a new benchmark for evaluating the generative natural language processing (NLP) models. The benchmark is claimed to enable model scoring by humans at scale.

We have seen an explosion of natural language processing (NLP) datasets aimed at testing various capabilities including machine comprehension, commonsense reasoning, summarisation etc in the last few years. 

Also, leaderboards such as GLUE benchmark have proven successful in promoting progress on a wide array of datasets and tasks. According to researchers, however, their adoption so far has been limited to task setups with reliable automatic evaluation, such as classification or span selection. 

Behind GENIE

GENIE is a collection of leaderboards for text-generation tasks backed with the human evaluation of text generation. GENIE works by posting model predictions to a crowdsourcing platform, where human annotators evaluate them according to predefined, dataset-specific guidelines. It is an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.

This new benchmark also presents various automatic metrics to examine how well they correlate with the human assessment scores. The researchers integrated popular datasets in English from four diverse tasks into GENIE, including machine translation, question answering, summarisation, and commonsense reasoning. 


During the development of GENIE, the researchers faced several issues-

  • Each submission entailed a crowdsourcing fee, which might deter the submissions from researchers with limited resources. The researchers tackle this difficulty by keeping a single submission cost in the range of $100 and plan to pay for initial GENIE submissions from academic groups to further encourage participation.
  • One of the major concerns for a human-evaluation leaderboard is the reproducibility of human annotations over time and across various annotator populations. Thus, the researchers described the mechanisms introduced into GENIE, including estimating annotator variance and spreading the annotations across multiple days.


  • GENIE provides text generation model developers with the ease of the “leaderboard experience,” alleviating the evaluation burden while ensuring high-quality, standardised comparison against previous models. 
  • The new benchmark facilitates the study of human evaluation interfaces, addressing challenges such as annotator training, inter-annotator agreement, and reproducibility.
  • GENIE helps developers of automatic evaluation metrics by serving as a hub of model submissions and associated human scores, which they can leverage to train or test against.


  • Introduced GENIE, a new benchmark for evaluating generative NLP models that enables model scoring by humans at scale.
  • Released a public leaderboard for generative NLP tasks and formalised methods for converting crowd worker responses into model performance estimates and confidence intervals.
  • Conducted case studies on the human-in-the-loop leaderboard in terms of its reproducibility of results across time and correlation with expert judgements as a function of various design choices (number of collected annotations, scales of annotations, among others).
  • Established the human baseline results using state-of-the-art generative models for several popular tasks and demonstrated that these tasks are far from being solved.

Wrapping Up

The researchers stated, “To the best of our knowledge, GENIE is the first human-in-the-loop leaderboard to support controlled evaluation of a broad set of natural language tasks, and which is based on experiments aimed to ensure scoring reliability and design decisions for best-practice human evaluation templates.” Currently, the research is focused on the English language, mostly due to easy integration with crowdsourcing platforms. In the coming years, the researchers are hoping to integrate datasets from other languages.

Read the paper here.

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it