Now Reading
Can We Crowdsource Benchmarks For Evaluating NLP Models?

Can We Crowdsource Benchmarks For Evaluating NLP Models?

  • Allen Institute for AI has introduced a new leaderboard for evaluating generative NLP models.

Recently, a team of researchers from Allen Institute for AI, University of Washington and Hebrew University of Jerusalem has introduced a new leaderboard for human-in-the-loop text generation benchmarking, known as GENIE. According to its developers, GENIE is a new benchmark for evaluating the generative natural language processing (NLP) models. The benchmark is claimed to enable model scoring by humans at scale.

We have seen an explosion of natural language processing (NLP) datasets aimed at testing various capabilities including machine comprehension, commonsense reasoning, summarisation etc in the last few years. 


Also, leaderboards such as GLUE benchmark have proven successful in promoting progress on a wide array of datasets and tasks. According to researchers, however, their adoption so far has been limited to task setups with reliable automatic evaluation, such as classification or span selection. 

Behind GENIE

GENIE is a collection of leaderboards for text-generation tasks backed with the human evaluation of text generation. GENIE works by posting model predictions to a crowdsourcing platform, where human annotators evaluate them according to predefined, dataset-specific guidelines. It is an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.

This new benchmark also presents various automatic metrics to examine how well they correlate with the human assessment scores. The researchers integrated popular datasets in English from four diverse tasks into GENIE, including machine translation, question answering, summarisation, and commonsense reasoning. 


During the development of GENIE, the researchers faced several issues-

  • Each submission entailed a crowdsourcing fee, which might deter the submissions from researchers with limited resources. The researchers tackle this difficulty by keeping a single submission cost in the range of $100 and plan to pay for initial GENIE submissions from academic groups to further encourage participation.
  • One of the major concerns for a human-evaluation leaderboard is the reproducibility of human annotations over time and across various annotator populations. Thus, the researchers described the mechanisms introduced into GENIE, including estimating annotator variance and spreading the annotations across multiple days.


  • GENIE provides text generation model developers with the ease of the “leaderboard experience,” alleviating the evaluation burden while ensuring high-quality, standardised comparison against previous models. 
  • The new benchmark facilitates the study of human evaluation interfaces, addressing challenges such as annotator training, inter-annotator agreement, and reproducibility.
  • GENIE helps developers of automatic evaluation metrics by serving as a hub of model submissions and associated human scores, which they can leverage to train or test against.


  • Introduced GENIE, a new benchmark for evaluating generative NLP models that enables model scoring by humans at scale.
  • Released a public leaderboard for generative NLP tasks and formalised methods for converting crowd worker responses into model performance estimates and confidence intervals.
  • Conducted case studies on the human-in-the-loop leaderboard in terms of its reproducibility of results across time and correlation with expert judgements as a function of various design choices (number of collected annotations, scales of annotations, among others).
  • Established the human baseline results using state-of-the-art generative models for several popular tasks and demonstrated that these tasks are far from being solved.

Wrapping Up

The researchers stated, “To the best of our knowledge, GENIE is the first human-in-the-loop leaderboard to support controlled evaluation of a broad set of natural language tasks, and which is based on experiments aimed to ensure scoring reliability and design decisions for best-practice human evaluation templates.” Currently, the research is focused on the English language, mostly due to easy integration with crowdsourcing platforms. In the coming years, the researchers are hoping to integrate datasets from other languages.

Read the paper here.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top