Datasets: A Community Library for NLP by Hugging Face

The size, variety, and number of publicly available NLP (Natural Language Processing) datasets have grown rapidly as researchers propose new goals, larger models, and unique benchmarks.


There has been tremendous research in NLP applications since its implementation. Nowadays we have powerful tools such as BERT which facilitates a robust NLP model on the fly. When preparing for such models we often spend plenty of time gathering appropriate data and for that, we have to go through various repositories such as Kaggle, UCI ML, etc. So is there any way where we can access a variety of such data in one place?  The answer is yes. A few months back, Hugging Face introduced its Community library called Datasets which facilitates more than 600 publicly available datasets in a standard format in 467 different languages. So in this post, we are going to discuss this framework and practically see how we can leverage it. The major points to be discussed are listed below.

Table of Contents

  1. Need of this Community Library
  2. Library Design
  3. Implementing in Python

Let’s start the discussion by understanding the need for this framework.


Sign up for your weekly dose of what's up in emerging technology.

Need of this Community Library

The size, variety, and number of publicly available NLP (Natural Language Processing) datasets have grown rapidly as researchers propose new goals, larger models, and unique benchmarks. For assessment and benchmarking, curated datasets are used; supervised datasets are used for training and fine-tuning models, and massive unsupervised datasets are required for pretraining and language modelling. Each dataset type has a different scale, granularity, and structure, in addition to the annotation approach.

In the past, new dataset paradigms have been critical in propelling NLP forward. Today’s NLP systems consist of a pipeline that includes a wide range of datasets with varying dimensions and levels of annotation. Several datasets are used for pretraining, fine-tuning, and benchmarking. As a result, the number of datasets available to the NLP community has skyrocketed. As the number of datasets grows, significant issues such as interface standardization, versioning, and documentation arise.

Without having to use multiple interfaces, one should be able to work with a variety of datasets. Furthermore, a group of people working on the same dataset should be aware that they are all using the same version. As a result of this magnitude, interfaces should not have to change.

This is where a Dataset comes into play. Datasets is a modern NLP community library that was created to help the NLP community. Datasets aim to standardize end-user interfaces, versioning, and documentation while also providing a lightweight front-end that can handle small datasets as well as large internet corpora. 

The library was built with a distributed, community-driven approach to dataset addition and usage documentation in mind. The library now has over 650 unique datasets, over 250 contributors, and has supported many original cross-dataset research initiatives and shared tasks after a year of hard work.

Datasets is a community library dedicated to addressing data management and access issues while also promoting community culture and norms. The project has hundreds of contributors from all over the world, and each dataset is tagged and documented. Each dataset is expected to be in a standard tabular format that can be versioned and cited; datasets are computed- and memory-efficient by default, and they work well with tokenization and featurization.

Library Design

The users can access the dataset by simply referring to a global variable. Each dataset has its own feature schema and metadata. For every dataset users need not load the whole dataset, Datasets has provided 3 folds for nearly all datasets, and users can load them separately and can access them by indexing. Additionally, we can apply various pre-processing steps directly to the corpus.  

Datasets have divided all its procedures into simple four steps as follows,

Dataset Retrieval and Building

The underlying raw datasets are not hosted by Datasets; instead, it uses a distributed approach to access hosted data from the original authors. Each dataset has a builder module contributed by the community. The builder module is in charge of converting unstructured data, such as text or CSV, into a standardized dataset interface.

DataPoint Representation

Internally, each built dataset is represented as a table with typed columns. A variety of common and NLP-targeted dataset types are available in the Dataset type system. Aside from atomic values (ints, floats, strings, and binary blobs) and JSON-like dicts and lists, the library also includes named categorical class labels, sequences, paired translations, and higher dimension arrays for images, videos, or waveforms.

Memory Access

Datasets is built on Apache Arrow, a cross-language columnar data framework. Arrow includes a local caching system that allows datasets to be backed up by a memory-mapped on-disk cache for the quick lookup. This architecture allows large datasets to be used on machines with limited device memory. Arrow also allows for copy-free handoffs to popular machine learning tools such as NumPy, Pandas, Torch, and TensorFlow.

User Processing

The library provides access to typed data with minimal preprocessing when downloaded. It includes sorting, shuffling, splitting, and filtering functions for manipulating datasets. It has a powerful map function for complex manipulations that supports arbitrary Python functions for creating new in-memory tables. The map can be run in batched, multi-process mode to apply processing in parallel to large datasets. Data processed by the same function is also cached automatically between sessions.

The Complete Flow of the Query

When you request a dataset, it is downloaded from its original host. This triggers the execution of dataset-specific builder code, which converts the text into a typed tabular format that conforms to the feature schema and caches the table. The user is given a memory-mapped table. The user can run any vectorized code and cache the results to perform additional data processing, such as tokenization.

Python Implementation

Here in this section we practically see how we can leverage Datasets to Build NLP-related applications. In this implementation first, we will see how we can preview and load the dataset, pre-process it, and make it compatible for modelling it. Let’s start with installing and importing the dependencies. 

! pip install datasets
! pip install transformers
from datasets import list_datasets, load_dataset, list_metrics, load_metric, load_dataset_builder

It is often useful to quickly obtain all relevant information about a dataset before taking the time to download it. The datasets.load dataset builder() method allows you to inspect a dataset’s attributes without having to download it.  

dataset_builder = load_dataset_builder('imdb')
# get feature information


# get fold information


Once you’ve found the dataset you want, load it with datasets in a single line. With load_dataset(), you can see the entire schema by simply printing the variable. Or even you can convert it into a CSV version as shown below.

data = load_dataset('imdb',split='train')

We’ve seen how to load a dataset from the Hugging Face Hub and access the data it contains so far. Now we’ll tokenize our data and use a framework like TensorFlow to analyze it. By default, all dataset columns are returned as Python objects. The columns are formatted to be compatible with TensorFlow types.

To begin, let’s take a look at tokenization. Tokenization is the process of separating the text into individual words known as tokens. Tokens are converted into numbers, which the model uses as input. Bring in a tokenizer. To ensure that the text is consistently split, we must use the tokenizer associated with the model. Because you’re using the BERT model in this example, load the BERT tokenizer.

import tensorflow as tf
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encoded_data = e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)

Tensorflow and Pytorch are two widely used frameworks for model building. We’ll continue with the Tensorflow example. To wrap the dataset with, we can use to_tf_dataset(). This indicates a The dataset object can be iterated over to produce batches of data, which can then be passed directly to methods such as to_tf_dataset() takes a number of arguments such as,

  • columns: which columns should be formatted specify which columns should be formatted (includes the inputs and labels).
  • shuffle: If the dataset should be shuffled, shuffle is used.
  • batch_size: parameter that specifies the batch size.
  • collate fn: specifies a data collator that will batch and pad each processed example. If you’re using a DataCollator to return tf, make sure you set return_tensors=”tf” when you initialize it. 
# making compatible dataset for Tensorflow
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
train_dataset = encoded_data.to_tf_dataset(
   columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'],

Now we have created a dataset that is ready to use in the training loop for Tensorflow models. Let’s take a look at it.


Final Words

The Datasets core library is intended to be simple to use, fast, and to employ the same interface for datasets of varying sizes. Having over 600 datasets in a single location is a gift for any developer or novice. We attempted to understand how the library is organized in this post and demonstrated how we can use it for various NLP applications.


More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM