DeepMind’s Gato is the Swiss Army Knife of AI models

Gato is a multi-modal, multi-task, multi-embodiment generalist policy.
Image © Analytics India Magazine
Listen to this story

The arrival of deep neural networks has been a watershed moment in artificial intelligence history. We have made huge strides in natural language understanding and object recognition in a short period. However, we don’t have AI models that do both.

Enter Gato. 


Sign up for your weekly dose of what's up in emerging technology.

DeepMind has leveraged the advances in large-scale language modelling to build a single generalist agent beyond the scope of text outputs. Gato is a multi-modal, multi-task, multi-embodiment generalist policy: The same network with the same weights can play Atari, caption images, chat and stack blocks with a real robot arm.

How does Gato work?

To train Gato, the researchers collected data from different tasks and modalities. The data was then serialized into a flat sequence of tokens, then batched and processed by a transformer neural network. “While any general sequence model can work for next token prediction, we chose a transformer for simplicity and scalability,” the researchers stated in the paper. The researchers have used a 1.2 billion parameter decoder-only transformer with 24 layers and an embedding size of 2048.

Gato is trained on many datasets with information about agent experience in simulated and real-world environments. Natural language and image datasets were also used. 

A prompt is tokenized during the deployment phase to form the initial sequence. Following this, the environment yields the first observation, tokenized and appended to the sequence. Next, Gato samples the action vector autoregressively. It comprehends one token at a time, and once all tokens have been sampled, Gato decodes the action and sends it to the environment. The environment then yields a new observation, and the process is repeated in a loop. “The model always sees all previous observations and actions within its context window of 1024 tokens,” the researchers said. 

How does Gato stack up against other models?

The success stories of GPT-3, Gopher and Flamingo inspired the DeepMind researchers to push the limits of generalist language models and generalist visual language models.

Early this year, Google introduced Pathways Language Model (PaLM), building on the Pathways system announced before. The 540-billion parameter, dense decoder-only Transformer model, trained with the Pathways system, was able to train a single model across multiple TPU v4 Pods efficiently. With Pathways, Google Research’s end game is to build a single model that could generalize across domains and tasks while being highly efficient. PaLM achieved state-of-the-art few-shot performance across hundreds of language understanding and generation tasks, and in many cases, by significant margins.

In January, Meta AI released data2vec, the first high-performance self-supervised algorithm for multiple modalities. The data2vec outperformed the previous best single-purpose algorithms for computer vision and speech and was competitive on NLP tasks. The algorithm marks a paradigm shift in holistic self-supervised learning. data2vec brings us closer to building machines that can make sense of the world. 

DeepMind’s Gopher is a 280-billion-parameter NLP model based on the Transformer architecture and trained on 10.5TB of MassiveText. Gopher surpassed the current state-of-the-art on 100 evaluation tasks. The model was also tested on NLP benchmarks, including the Massive Multitask Language Understanding (MMLU) and BIG-bench, and the performance was compared to other baseline models. Gopher showed steady improvement on knowledge-intensive tasks but not so much on reasoning-heavy tasks.In the same league as Gopher, Google’s Generalist Language Model (GLaM) is a trillion weight model that achieves a competitive advantage on multiple few-shot learning tasks. GLaM is a mixture of experts model with different submodels specialized for different inputs. It achieves competitive performance on multiple few-shot learning tasks. GLaM was on-par on seven tasks while using 5x less computation during inference. The tasks included open domain question answering, commonsense reading, in-context reading comprehension, the SuperGLUE tasks and natural language inference.

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.