Facebook Inch Closer To Building General-Purpose Intelligence Agents With UniT

Recently, researchers from Facebook AI Research (FAIR) introduced a new Transformer model with the ability to learn tasks across multiple domains in a simultaneous manner, known as Unified Transformer (UniT). According to the researchers, the Transformer model takes images and texts as inputs and trains the inputs on various tasks ranging from visual perception and language understanding to vision and language reasoning.

In the past few years, Transformer models have proved their worth in a wide range of domains, such as natural language, images, video, audio etc. Language models like BERT, GPT, XLNet, AlBERT have made huge strides.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Transformers trained on huge datasets can learn strong representations for a broad range of downstream language tasks. According to the researchers, despite the achievements of Transformer models to specific domains, there has not been many efforts to connect multiple tasks across domains.

The Facebook AI researchers started with the question: “Could a Transformer trained for natural language inference on textual input also accomplish object detection on images, or could a Transformer-based image classifier also perform textual entailment?”

Download our Mobile App

Behind UniT

UniT model

UniT is built on the Transformer encoder-decoder architecture and consists of separate encoders for each input modality type followed by a decoder with simple task-specific heads. UniT packs encoding modules that encode each input modality as a sequence of hidden states and a decoder over the encoded input modalities, followed by the task-specific output heads to obtain the final predictions for each given tasks. 

The researchers stated: “Compared to the previous work on multi-task learning with Transformers, we train Unified Transformer and achieve comparable performance to the well-established prior work on a much larger variety of tasks; that is not only joint vision-and-language tasks such as visual question-answering (VQA), but also vision-only as well as language-only tasks.”


  • The researchers proposed a unified Transformer encoder-decoder architecture called UniT that connects and learns multiple tasks and domains in a single model.
  • The model jointly learned the tasks in the visual and textual domains and their intersections, such as visual question answering, object detection, visual entailment, and natural language understanding tasks in the GLUE benchmark, including QNLI, MNLI, QQP and SST-2. The researchers also showed these diverse tasks could be learned simultaneously and converge properly under the training scheme.
  • Multimodal tasks such as visual question-answering (VQA) and visual entailment benefit from multi-task training with uni-modal tasks.

Wrapping Up

The Facebook AI researchers demonstrated that the Transformer framework can be applied over various domains to handle multiple tasks within a single unified encoder-decoder model. The UniT model simultaneously addressed seven tasks across eight different datasets and achieved great performance on each task with a single set of shared parameters. 

According to the researchers, the UniT model has a domain-agnostic transformer architecture, which makes the model a huge step towards building general-purpose intelligence agents capable of handling a wide range of applications in different domains, including visual perception, language understanding, and reasoning over multiple modalities.

Read the whole research here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox