Active Hackathon

Hugging Face Releases Perceiver IO, A Next Generation Transformer

Hugging Face’s newly added Perceiver IO to Transformers works on all modalities like text, images, audio, etc.

Hugging Face has added Perceiver IO, the first Transformer-based neural network that works on all kinds of modalities, including text, images, audio, video, point clouds and even combinations of these.

The Transformer architecture has a limitation where its self-attention mechanism scales very poorly in compute as well as memory. In every layer, all inputs are used to produce queries and keys, for which a pairwise dot product is computed. Hence, it is impossible to apply self-attention to high-dimensional data without some form of preprocessing. The Perceiver solves this by employing the self-attention mechanism on a set of latent variables rather than on the inputs. The inputs are only used for doing cross-attention with the latents. This has the advantage that the bulk of compute happens in a latent space, where compute is cheap. The resulting architecture has no quadratic dependence on the input size: the Transformer encoder only depends linearly on the input size, while latent attention is independent of it.


Sign up for your weekly dose of what's up in emerging technology.

To initialize a PerceiverModel, developers can provide three additional instances to the model – a preprocessor, a decoder, and a postprocessor.

The inputs are first optionally preprocessed using a preprocessor. Then the preprocessed inputs perform a cross-attention operation with the latent variables of the Perceiver encoder. After that, the decoder can be used to decode the final hidden states of the latents into something more useful, like classification logits. Last, the postprocessor can be used to post-process the decoder outputs to specific features.

Implementing Perceiver:

  • Perceiver for text: PerceiverTextPreprocessor, as preprocessor to the model, takes care of embedding the inputs and adds absolute position embeddings. As decoder, one provides PerceiverClassificationDecoder to the model and postprocessor is not required here.
  • Perceiver for images: The Perceiver to perform text classification, it is straightforward to apply the Perceiver to do image classification. There is a different preprocessor to the model, which will embed the image inputs.
  • Perceiver for multimodal autoencoding: The goal of multimodal autoencoding is to learn a model that can accurately reconstruct multimodal inputs in the presence of a bottleneck induced by an architecture. 

The advantage of the Perceiver is that the compute and memory requirements of the self-attention mechanism don’t depend on the size of the inputs and outputs, as the bulk of compute happens in a latent space (a not-too large set of vectors). The model is available in HuggingFace Transformers

More Great AIM Stories

Meeta Ramnani
Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM