Guide to Robustness Gym: Unifying the NLP Evaluation Landscape

Once the AI/ML model is built, researchers spend a considerable amount of time to come up with different parameters on which that model should be evaluated. Evaluation methods are problem-specific. Recently, Stanford University along with Salesforce Research and UNC-Chapel Hill has proposed a system for the evaluation of NLP pipelines, commonly referred to as Robustness Gym. This framework was first submitted as a research paper: Robustness Gym: Unifying the NLP Evaluation Landscape, to ArXiV on January 13, 2021, by Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, Christopher Ré.

Robustness Gym is a simple python toolkit for evaluating the NLP systems systematically and it works across multiple idioms, dealing with data errors, distribution change, biasness, etc. The following are problems that are being focussed on by this framework.


Sign up for your weekly dose of what's up in emerging technology.
  • The paradox of choice: Given a particular problem and its specification, what type of evaluation to be run like biasness, generalization, etc
  • Idiomatic Lock-In: Once the Paradox of choice is selected, Idiomatic Lock-In refers to the choice of tool to execute it. Four of the unique evaluation idioms in the existing toolkits are —subpopulations, transformations, adversarial attacks and evaluation sets. 
  • Workflow Fragmentation: It refers to keeping track of all the progress by saving all the data and generating reports. 

Robustness Gym addresses the above challenges by Contemplate–>Create–> Consolidate evaluation loop where 

  • Contemplate helps in choosing what evaluation to run(Paradox of choice) by giving directions on decision variables.
  • Create slices the data into different collections by using evaluation idioms. 
  • Consolidate arranges all the slices(from Create) into TestBench and creates reports.

 Conventionally, the evaluation procedure involves three steps i.e., 

  1. Loading the data.
  2. Generate Predictions using the built-model.
  3. Compute the metrics.

But in Robustness Gym, this same procedure has been divided into six steps(section 3). The whole workflow diagram of Robustness Gym is mentioned below. I recommend you to go through this article, before proceeding further.

  1. Requirements

Python > = 3.8 , <4.0

  1. Installation

Install Robustness Gym toolkit via pip. Might take some time to install.

Install the latest release of robustnessgym. In this case, the version is 0.0.3. First create a conda environment, activate the created environment and install the Robustness Gym framework and its dependies and then add the environment to your jupyter notebook. Type these commands one by one on your terminal.

conda create --name robustnessgym python=3.8 -y

source activate robustnessgym

pip install robustnessgym==0.0.3

python -m spacy download en_core_web_sm

conda install -c anaconda ipykernel

python -m ipykernel install --user --name=robustnessgym

  1. Robustness Gym Workflow

As discussed above, in contrast to traditional evaluation steps, Robustness Gym follows six steps which we are going to discuss in detail.

3.1 Load the data

Robustness Gym supports Huggingface datasets and it is very easy to use. Here is an example loading Boolq dataset of question-answering.

 import robustnessgym as rg
 # Load the boolq data
 dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
 # Load the first 10 training examples
 dataset = rg.Dataset.load_dataset('boolq', split='train[:10]') 

3.2 Compute and Cache-side information 

In this part, we perform some pre-processing on the data and compute some information on example which later can be used for some kind of analysis.  The idea of CachedOperation is quite similar to .map() on your dataset, except the fact that it can give any information which you have cached earlier. An example of it is shown below.

 # Get a dataset
 from robustnessgym import Dataset
 dataset = Dataset.load_dataset('boolq')["train"]
 # Run the Spacy pipeline
 from robustnessgym import Spacy
 spacy = Spacy()
 # .. on the 'question' column of the dataset
 dataset = spacy(batch_or_dataset=dataset,
 # Run the Stanza pipeline
 from robustnessgym import Stanza
 stanza = Stanza()
 # .. on both the question and passage columns of a batch
 dataset = stanza(batch_or_dataset=dataset[:32],
                  columns=['question', 'passage'])
 # .. use any of the other built-in operations in Robustness Gym!
 # Or, create your own CachedOperation
 from robustnessgym import CachedOperation, Identifier
 from robustnessgym.core.decorators import single-column
 # Write a silly function that operates on a single column of a batch
 def silly_fn(batch, columns):
     Capitalize text in the specified column of the batch.
     column_name = columns[0]
     #assert type(batch[column_name]) == str, "Must apply to text column."
     return [text.capitalize() for text in batch[column_name]]
 # Wrap the silly function in a CachedOperation
 silly_op = CachedOperation(apply_fn=silly_fn,
 # Apply it to a dataset
 dataset = silly_op(batch_or_dataset=dataset,

Retrieve the cached information:

 from robustnessgym import Spacy, Stanza, CachedOperation, Dataset
 # Take a batch of data
 batch = dataset
 # Retrieve the (cached) results of the Spacy CachedOperation
 spacy_information = Spacy.retrieve(batch, columns=['question'])
 # Retrieve the tokens returned by the Spacy CachedOperation
 tokens = Spacy.retrieve(batch, columns=['question'], proc_fns=Spacy.tokens)
 # Retrieve the entities found by the Stanza CachedOperation
 entities = Stanza.retrieve(batch, columns=['passage'], proc_fns=Stanza.entities)
 # Retrieve the capitalized output of the silly_op
 capitalizations = CachedOperation.retrieve(batch,
 # Retrieve it directly using the silly_op
 capitalizations = silly_op.retrieve(batch, columns=['question'])
 # Retrieve the capitalized output and lower-case it during retrieval
 capitalizations = silly_op.retrieve(
     proc_fns=lambda decoded_batch: [x.lower() for x in decoded_batch]

3.3 Build slices

With the help of cached information, slices of data are being made. These slices are just the collection of examples for evaluation which provide a method for the retrieval of cached information. Robustness Gym uses SliceBuilder class to do this work. Currently, Robustness Gym supports four types of slices.

  1. Evaluation Sets: slice constructed from a pre-existing dataset
 from robustnessgym import Dataset, Slice
 # Evaluation Sets: direct construction of a slice
 boolq_slice = Slice(Dataset.load_dataset('boolq')["train"]) 
  1. Subpopulations: slice constructed by filtering a larger dataset
 from robustnessgym import Spacy, ScoreSubpopulation, Identifier, Dataset
 from robustnessgym.core.decorators import prerequisites
 dataset = Dataset.load_dataset('boolq', split='validation')
 # `datasets` has made some updates, temporary workaround to set the dataset identifier that we'll fix in v0.0.4
 dataset._identifier = dataset.identifier.without('version')(
 spacy = Spacy()
 dataset = spacy(dataset, ['question'])
 def length(batch, columns):
     Length using cached Spacy tokenization.
     column_name = columns[0]
     # Take advantage of previously cached Spacy informations
     tokens = Spacy.retrieve(batch, columns, proc_fns=Spacy.tokens)
     return [len(tokens_) for tokens_ in tokens]
 # Create a subpopulation that buckets examples based on length
 # `prerequisites` is a temporary workaround to specify that `length` requires Spacy to be cached
 # this will not be required in v0.0.4
 length_subpopulation = prerequisites(Spacy)(ScoreSubpopulation)(
     identifiers=[Identifier('0-10'), Identifier('10-20')],
     intervals=[(0, 10), (10, 20)],
 # v0.0.3 no longer modifies `dataset`
 sls, mat = length_subpopulation(dataset, columns=['question']) 
  1. Transformations: slice constructed by transforming a dataset.
  2. Attacks: slice constructed by attacking a dataset adversarially

3.4 Evaluate slices

In this section, we simply use the traditional metric on the slices to do the evaluation.

3.5. Report and share findings 

    The dashboard feature will soon be publicly available.

3.6. Iterate


In this article, we have talked about the NLP evaluation toolkit called Robustness Gym and examined all the steps required in this framework in comparison to traditional methods. This library is currently under development and many more pipelines and functionalities are going to be integrated very soon. Currently, available version is 0.0.3. There are high possibilities that some things might not work in the current framework. 

Official code, docs & tutorial are available at:

More Great AIM Stories

Aishwarya Verma
A data science enthusiast and a post-graduate in Big Data Analytics. Creative and organized with an analytical bent of mind.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM