Can We Crowdsource Benchmarks For Evaluating NLP Models?

Recently, a team of researchers from Allen Institute for AI, University of Washington and Hebrew University of Jerusalem has introduced a new leaderboard for human-in-the-loop text generation benchmarking, known as GENIE. According to its developers, GENIE is a new benchmark for evaluating the generative natural language processing (NLP) models. The benchmark is claimed to enable model scoring by humans at scale.

We have seen an explosion of natural language processing (NLP) datasets aimed at testing various capabilities including machine comprehension, commonsense reasoning, summarisation etc in the last few years. 

Also, leaderboards such as GLUE benchmark have proven successful in promoting progress on a wide array of datasets and tasks. According to researchers, however, their adoption so far has been limited to task setups with reliable automatic evaluation, such as classification or span selection. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Behind GENIE

GENIE is a collection of leaderboards for text-generation tasks backed with the human evaluation of text generation. GENIE works by posting model predictions to a crowdsourcing platform, where human annotators evaluate them according to predefined, dataset-specific guidelines. It is an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.

This new benchmark also presents various automatic metrics to examine how well they correlate with the human assessment scores. The researchers integrated popular datasets in English from four diverse tasks into GENIE, including machine translation, question answering, summarisation, and commonsense reasoning. 


During the development of GENIE, the researchers faced several issues-

  • Each submission entailed a crowdsourcing fee, which might deter the submissions from researchers with limited resources. The researchers tackle this difficulty by keeping a single submission cost in the range of $100 and plan to pay for initial GENIE submissions from academic groups to further encourage participation.
  • One of the major concerns for a human-evaluation leaderboard is the reproducibility of human annotations over time and across various annotator populations. Thus, the researchers described the mechanisms introduced into GENIE, including estimating annotator variance and spreading the annotations across multiple days.


  • GENIE provides text generation model developers with the ease of the “leaderboard experience,” alleviating the evaluation burden while ensuring high-quality, standardised comparison against previous models. 
  • The new benchmark facilitates the study of human evaluation interfaces, addressing challenges such as annotator training, inter-annotator agreement, and reproducibility.
  • GENIE helps developers of automatic evaluation metrics by serving as a hub of model submissions and associated human scores, which they can leverage to train or test against.


  • Introduced GENIE, a new benchmark for evaluating generative NLP models that enables model scoring by humans at scale.
  • Released a public leaderboard for generative NLP tasks and formalised methods for converting crowd worker responses into model performance estimates and confidence intervals.
  • Conducted case studies on the human-in-the-loop leaderboard in terms of its reproducibility of results across time and correlation with expert judgements as a function of various design choices (number of collected annotations, scales of annotations, among others).
  • Established the human baseline results using state-of-the-art generative models for several popular tasks and demonstrated that these tasks are far from being solved.

Wrapping Up

The researchers stated, “To the best of our knowledge, GENIE is the first human-in-the-loop leaderboard to support controlled evaluation of a broad set of natural language tasks, and which is based on experiments aimed to ensure scoring reliability and design decisions for best-practice human evaluation templates.” Currently, the research is focused on the English language, mostly due to easy integration with crowdsourcing platforms. In the coming years, the researchers are hoping to integrate datasets from other languages.

Read the paper here.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox