DeepMind has open-sourced their general-purpose deep learning model Perceiver IO. The tool can handle many different inputs and outputs and serves as a ‘drop-in’ replacement for transformers.
The original Perceiver model supported many kinds of inputs but was limited to producing straightforward outputs. Its successor is Perceiver IO, which can handle arbitrary outputs in single addition to random inputs. It is a more general version of the original architecture. Broadening the model’s capacity, Perceiver IO is a single network that can easily integrate and transform arbitrary information for arbitrary tasks.
Perceiver IO’s research paper states, “Perceiver IO overcomes the limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics.”
The model is suitable for different applications given its capacity to produce various outputs from various inputs. For instance, it can perform in real-world domains like language, vision, and challenging games like StarCraft II, all of them having a multimodal understanding.
Perceiver IO can classify labels, produce language, optical flow, and multimodal videos with audio using the same building blocks as the original model. It can handle and process large size input-outputs better than standard Transformers, given its linear computational complexity. The bulk of the processing occurs in the latent space to further facilitate this. This allows Perceiver IO to perform BERT-style masked language modelling by directly using bytes (and not tokenised inputs).
The Hurdle for Transformers
Built on the Transformer architecture, Perceiver uses ‘attention’ to map inputs and outputs. Traditionally, this allows the model to process inputs after comparing all of its elements and basing them on their relationship and the task. However, while attention is widely used, it becomes expensive as the inputs grow, such as in common forms of data like images and videos containing millions of elements.
Perceiver’s Architecture
The Perceiver IO has overcome the above-mentioned issue by “scaling the Transformer’s attention operation to substantial inputs without introducing domain-specific assumptions”. The model architecture uses cross-attention to project high-dimensional input arrays into a lower-dimensional latent space. These can later be processed, but at a cost that is independent of the input’s size. Lastly, the latent representation is converted to output by applying a query array with the same number of elements as the desired output data. Deep models can flourish in this setting since the computational needs can grow along with the input growth.
Credits: DeepMind Blog – The PerceiverIO Architecture
The three steps of the Perceiver IO pipeline:
- Inputs are encoded to a latent space
- The latent representation is refined via many layers of processing
- The latent space is decoded to produce outputs
Features
This growth allows Perceiver IO to achieve that unprecedented level of generality and versatility over the original model that could only produce one output per input. In addition, Perceiver IO is competitive with domain-specific models on benchmarks based on images, 3D point clouds, and audio and ideas together, making it a fit for researchers.
Along with attention to encoding, Perceiver IO also uses it to decode from the latent array – enhancing the flexibility of the network, scaling it to more extensive & more diverse input-outputs, all the while dealing with many types of data at once.
This feature makes Perceiver IO a supermodel that can perform various applications like understanding the meaning of a text from each of its characters, playing games, tracking the movement of all points in an image, and processing the sound, images, and labels that make up a video. This is possible while just using a single architecture that’s simpler than the alternatives.
Deepmind’s experiments concluded that Perceiver IO could work across a wide range of benchmark domains, including language, vision, multimodal data, and games. It successfully provides an off-the-shelf way to handle many kinds of data.
To help researchers and machine learning communities at large, Deepmind has now open-sourced its code.