With the growing number of large-language models and a multi-modal approach to training, DeepMind has released a multi-modal, multi-tasking, multi-embodiment generalist policy called Gato. The sole generalist agent was trained using data from a variety of tasks and modalities in a way that the same network with the same weights can do everything from playing Atari, writing captions for images, chatting and using a robot arm to stack blocks to navigating in simulated 3D environments. DeepMind has also released a paper titled, ‘A Generalist Agent,’ which described the training process and the model’s capabilities.
Source: DeepMind research paper
Similar to the training process followed with large language models, the training data is serialised into a flat sequence of tokens, made into batches, and processed by a transformer neural network. The basic tenet that Gato followed was to train using the widest range of data possible, including modalities like images, text, button presses, joint torques and other actions based on the context.
During the deployment stage, a prompt is tokenised, forming the initial sequence, after which the environment sends the first observation, which is, in turn, tokenised and added to the sequence. The model then samples an action vector autoregressively, one token after another.
Source: DeepMind research paper
The model demonstrated that transformer sequence models work better as multi-tasking policies for real-world scenarios and vision and robotic tasks. Gato shows the potential to take the first step to learn new tasks via prompting instead of training a model from scratch.