Recently, researchers from Facebook AI Research (FAIR) introduced a new Transformer model with the ability to learn tasks across multiple domains in a simultaneous manner, known as Unified Transformer (UniT). According to the researchers, the Transformer model takes images and texts as inputs and trains the inputs on various tasks ranging from visual perception and language understanding to vision and language reasoning.
In the past few years, Transformer models have proved their worth in a wide range of domains, such as natural language, images, video, audio etc. Language models like BERT, GPT, XLNet, AlBERT have made huge strides.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Transformers trained on huge datasets can learn strong representations for a broad range of downstream language tasks. According to the researchers, despite the achievements of Transformer models to specific domains, there has not been many efforts to connect multiple tasks across domains.
The Facebook AI researchers started with the question: “Could a Transformer trained for natural language inference on textual input also accomplish object detection on images, or could a Transformer-based image classifier also perform textual entailment?”
Download our Mobile App
Behind UniT
UniT model
UniT is built on the Transformer encoder-decoder architecture and consists of separate encoders for each input modality type followed by a decoder with simple task-specific heads. UniT packs encoding modules that encode each input modality as a sequence of hidden states and a decoder over the encoded input modalities, followed by the task-specific output heads to obtain the final predictions for each given tasks.
The researchers stated: “Compared to the previous work on multi-task learning with Transformers, we train Unified Transformer and achieve comparable performance to the well-established prior work on a much larger variety of tasks; that is not only joint vision-and-language tasks such as visual question-answering (VQA), but also vision-only as well as language-only tasks.”
Contributions
- The researchers proposed a unified Transformer encoder-decoder architecture called UniT that connects and learns multiple tasks and domains in a single model.
- The model jointly learned the tasks in the visual and textual domains and their intersections, such as visual question answering, object detection, visual entailment, and natural language understanding tasks in the GLUE benchmark, including QNLI, MNLI, QQP and SST-2. The researchers also showed these diverse tasks could be learned simultaneously and converge properly under the training scheme.
- Multimodal tasks such as visual question-answering (VQA) and visual entailment benefit from multi-task training with uni-modal tasks.
Wrapping Up
The Facebook AI researchers demonstrated that the Transformer framework can be applied over various domains to handle multiple tasks within a single unified encoder-decoder model. The UniT model simultaneously addressed seven tasks across eight different datasets and achieved great performance on each task with a single set of shared parameters.
According to the researchers, the UniT model has a domain-agnostic transformer architecture, which makes the model a huge step towards building general-purpose intelligence agents capable of handling a wide range of applications in different domains, including visual perception, language understanding, and reasoning over multiple modalities.
Read the whole research here.