Hugging Face has added Perceiver IO, the first Transformer-based neural network that works on all kinds of modalities, including text, images, audio, video, point clouds and even combinations of these.
The Transformer architecture has a limitation where its self-attention mechanism scales very poorly in compute as well as memory. In every layer, all inputs are used to produce queries and keys, for which a pairwise dot product is computed. Hence, it is impossible to apply self-attention to high-dimensional data without some form of preprocessing. The Perceiver solves this by employing the self-attention mechanism on a set of latent variables rather than on the inputs. The inputs are only used for doing cross-attention with the latents. This has the advantage that the bulk of compute happens in a latent space, where compute is cheap. The resulting architecture has no quadratic dependence on the input size: the Transformer encoder only depends linearly on the input size, while latent attention is independent of it.
To initialize a PerceiverModel, developers can provide three additional instances to the model – a preprocessor, a decoder, and a postprocessor.
The inputs are first optionally preprocessed using a preprocessor. Then the preprocessed inputs perform a cross-attention operation with the latent variables of the Perceiver encoder. After that, the decoder can be used to decode the final hidden states of the latents into something more useful, like classification logits. Last, the postprocessor can be used to post-process the decoder outputs to specific features.
Implementing Perceiver:
- Perceiver for text: PerceiverTextPreprocessor, as preprocessor to the model, takes care of embedding the inputs and adds absolute position embeddings. As decoder, one provides PerceiverClassificationDecoder to the model and postprocessor is not required here.
- Perceiver for images: The Perceiver to perform text classification, it is straightforward to apply the Perceiver to do image classification. There is a different preprocessor to the model, which will embed the image inputs.
- Perceiver for multimodal autoencoding: The goal of multimodal autoencoding is to learn a model that can accurately reconstruct multimodal inputs in the presence of a bottleneck induced by an architecture.
The advantage of the Perceiver is that the compute and memory requirements of the self-attention mechanism don’t depend on the size of the inputs and outputs, as the bulk of compute happens in a latent space (a not-too large set of vectors). The model is available in HuggingFace Transformers.