Hugging Face Releases Perceiver IO, A Next Generation Transformer

Hugging Face’s newly added Perceiver IO to Transformers works on all modalities like text, images, audio, etc.

Hugging Face has added Perceiver IO, the first Transformer-based neural network that works on all kinds of modalities, including text, images, audio, video, point clouds and even combinations of these.

The Transformer architecture has a limitation where its self-attention mechanism scales very poorly in compute as well as memory. In every layer, all inputs are used to produce queries and keys, for which a pairwise dot product is computed. Hence, it is impossible to apply self-attention to high-dimensional data without some form of preprocessing. The Perceiver solves this by employing the self-attention mechanism on a set of latent variables rather than on the inputs. The inputs are only used for doing cross-attention with the latents. This has the advantage that the bulk of compute happens in a latent space, where compute is cheap. The resulting architecture has no quadratic dependence on the input size: the Transformer encoder only depends linearly on the input size, while latent attention is independent of it.

To initialize a PerceiverModel, developers can provide three additional instances to the model – a preprocessor, a decoder, and a postprocessor.

The inputs are first optionally preprocessed using a preprocessor. Then the preprocessed inputs perform a cross-attention operation with the latent variables of the Perceiver encoder. After that, the decoder can be used to decode the final hidden states of the latents into something more useful, like classification logits. Last, the postprocessor can be used to post-process the decoder outputs to specific features.

Implementing Perceiver:

  • Perceiver for text: PerceiverTextPreprocessor, as preprocessor to the model, takes care of embedding the inputs and adds absolute position embeddings. As decoder, one provides PerceiverClassificationDecoder to the model and postprocessor is not required here.
  • Perceiver for images: The Perceiver to perform text classification, it is straightforward to apply the Perceiver to do image classification. There is a different preprocessor to the model, which will embed the image inputs.
  • Perceiver for multimodal autoencoding: The goal of multimodal autoencoding is to learn a model that can accurately reconstruct multimodal inputs in the presence of a bottleneck induced by an architecture. 

The advantage of the Perceiver is that the compute and memory requirements of the self-attention mechanism don’t depend on the size of the inputs and outputs, as the bulk of compute happens in a latent space (a not-too large set of vectors). The model is available in HuggingFace Transformers

Download our Mobile App

Meeta Ramnani
Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

Career Building in ML & AI

31st May | Online

Rakuten Product Conference 2023

31st May - 1st Jun '23 | Online

MachineCon 2023 India

Jun 23, 2023 | Bangalore

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox