Hands-On Guide to Hugging Face PerceiverIO for Text Classification

A perceiver is a transformer that can handle non-textual data like images, sounds, and video, as well as spatial data.

Nowadays, most deep learning models are highly optimized for a specific type of dataset. Computer vision and audio analysis can not use architectures that are good at processing textual data. This level of specialization naturally influences the development of models that are highly specialized in one task and unable to adapt to other tasks. So, in contrast to the General Purpose model, we will talk about PerceiverIO, which is designed to address a wide range of tasks with a single architecture. The following are the main points to be discussed in this article.

Table of Contents

  1. What is Perceiver IO?
  2. Architecture of PerceiverIO
  3. Implementing Perceiver IO for Text Classification

Let’s start the discussion by understanding the PerceiverIO.

What is PerceiverIO?

A perceiver is a transformer that can handle non-textual data like images, sounds, and video, as well as spatial data. Other significant systems that came before Perceiver, such as BERT and GPT-3, are based on transformers. It uses an asymmetric attention technique to condense inputs into a latent bottleneck, allowing it to learn from a great amount of disparate data. On classification challenges, Perceiver matches or outperforms specialized models.

The perceiver is free of modality-specific components. It lacks components dedicated to handling photos, text, or audio, for example. It can also handle several associated input streams of varying sorts. It takes advantage of a small number of latent units to create an attention bottleneck through which inputs must pass. One advantage is that it eliminates the quadratic scaling issue that plagued early transformers. For each modality, specialized feature extractors were employed previously.

Perceiver IO can query the model’s latent space in a variety of ways to generate outputs of any size and semantics. It excels at activities that need structured output spaces, such as natural language and visual comprehension and multitasking. Perceiver IO matches a Transformer-based BERT baseline without the need for input tokenization on the GLUE language benchmark and achieves state-of-the-art performance on Sintel optical flow estimation.

The latent array is attended to using a specific output query associated with that particular output to produce outputs. To predict optical flow on a single pixel, for example, a query would use the pixel’s XY coordinates along with an optical flow task embedding to generate a single flow vector. It’s a spin-off of the encoder/decoder architecture seen in other projects.

Architecture of PerceiverIO

The Perceiver IO model is based on the Perceiver architecture, which achieves cross-domain generality by assuming a simple 2D byte array as input: a set of elements (which could be pixels or patches in vision, characters or words in a language or some form of learned or unlearned embedding), each described by a feature vector. The model then uses Transformer-style attention to encode information about the input array using a smaller number of latent feature vectors, followed by iterative processing and a final aggregation down to a category label.

HuggingFace Transformers’ PerceiverModel class serves as the foundation for all Perceiver variants. To initialize a PerceiverModel, three further instances can be specified – a preprocessor, a decoder, and a postprocessor.

A preprocessor is optionally used to preprocess the inputs (which might be any modality or a mix of modalities). The preprocessed inputs are then utilized to execute a cross-attention operation utilizing the latent variables of the Perceiver encoder. 


Perceiver IO is a domain-agnostic process that maps arbitrary input arrays to arbitrary output arrays. The majority of the computation takes place in a latent space that is typically smaller than the inputs and outputs, making the process computationally tractable even when the inputs and outputs are very large.

In this technique (Referring to the above architecture), the latent variables create queries (Q), whilst the preprocessed inputs generate keys and values (KV). Following this, the Perceiver encoder updates the latent embeddings with a (repeatable) block of self-attention layers. Finally, the encoder will create a shape tensor (batch size, num latents, d latents) containing the latents’ most recently concealed states. Then there’s an optional decoder, which may be used to turn the final concealed states of the latent into something more helpful, like classification logits. This is performed by a cross-attention operation in which trainable embeddings create queries (Q) and latent generate keys and values (KV).

PerceiverIO for Text Classification

In this section, we will see how perceiver can be used to do the text classification. Now let’s install the Transformer and datasets module of Huggingface.

! pip install -q git+https://github.com/huggingface/transformers.git
! pip install -q datasets

Next, we will prepare the data from the module. The dataset is about IMDB movie reviews and we are using a chunk of it. Later after loading the dataset, we will make it handy when doing the inference.  

from datasets import load_dataset
# load the dataset
train_ds, test_ds = load_dataset("imdb", split=['train[:100]+train[-100:]', 'test[:5]+test[-5:]'])

# making the dataset handy
labels = train_ds.features['label'].names
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}


In this step, we will preprocess the dataset for tokenization. For that, we are using PerceiverTokenizer on both train and test datasets.

# Tikenization
from transformers import PerceiverTokenizer
tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
train_ds = train_ds.map(lambda examples: tokenizer(examples['text'], padding="max_length", truncation=True),
test_ds = test_ds.map(lambda examples: tokenizer(examples['text'], padding="max_length", truncation=True),

We are going to use PyTorch for further modelling and for that we need to set the format of our data compatible with the PyTorch.

# campatible with torch
from torch.utils.data import DataLoader
train_ds.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
test_ds.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
train_dataloader = DataLoader(train_ds, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_ds, batch_size=4)

Next, we will define and train the model.

from transformers import PerceiverForSequenceClassification
import torch
from transformers import AdamW
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score
# Define model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = PerceiverForSequenceClassification.from_pretrained("deepmind/language-perceiver",
# Train the model
optimizer = AdamW(model.parameters(), lr=5e-5)
for epoch in range(20):  # loop over the dataset multiple times
    print("Epoch:", epoch)
    for batch in tqdm(train_dataloader):
         # get the inputs; 
         inputs = batch["input_ids"].to(device)
         attention_mask = batch["attention_mask"].to(device)
         labels = batch["label"].to(device)
         # zero the parameter gradients
         # forward + backward + optimize
         outputs = model(inputs=inputs, attention_mask=attention_mask, labels=labels)
         loss = outputs.loss
         # evaluate
         predictions = outputs.logits.argmax(-1).cpu().detach().numpy()
         accuracy = accuracy_score(y_true=batch["label"].numpy(), y_pred=predictions)
         print(f"Loss: {loss.item()}, Accuracy: {accuracy}")

Now, let’s do the inference with the model.

text = "I loved this epic movie, the multiverse concept is mind-blowing and a bit confusing."
input_ids = tokenizer(text, return_tensors="pt").input_ids
# Forward pass
outputs = model(inputs=input_ids.to(device))
logits = outputs.logits 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted:", model.config.id2label[predicted_class_idx])


Final Words

Perceiver IO is an architecture that can handle general-purpose inputs and outputs while scaling linearly in both input and output sizes. As we have seen in practice, this architecture produces good results in a wide range of settings. However, we have only seen it for text data, it can also be used for audio, video, and image data as well making it a promising candidate for general-purpose neural network architecture.


Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox