Inside Multimodal Neural Network Architecture That Has The Power To “Learn It All”

Multimodal machine learning is a multi-disciplinary research field that addresses some of the original goals of artificial intelligence by integrating and modelling multiple communicative modalities, including linguistic, acoustic and visual messages. It is often referred to as building models that can process information from multiple sources. For example, a research paper on multimodal learning by Andrew Ng cites that audio and visual data for speech recognition can have correlations at a “mid-level”, as phonemes and visemes (lip pose and motions), but it can be difficult to relate to it in raw pixels or audio waveforms.

Now, Deep Learning technique have been successfully applied to unsupervised feature learning for single modalities (such as text, images or audio). Over the years, a main challenge for researchers has been “how to represent and summarize multi-modal data in a way that exploits the complementarity and redundancy of multiple modalities”.

The researchers tackle a simple question – can we create a unified deep learning model to solve tasks across multiple domains? This translates to building a model which has the power to “learn it all”.

Now, to get deep learning to work in each and every modality such as image, speech, text requires the tweaking of architecture besides extensive research into the particular problem. AI researchers have come up with a model that can be trained to perform multiple tasks. Earlier research from Google and University of Toronto came up with a one model that could deliver good results to numerous problems spanning multiple domains.

In fact, Google’s Multilingual Neural Machine Translation System applied for Google Translate, is a first step towards the convergence of vision, audio and language understanding into a single network, notes the Google Research blog.

Small Modality-Specific Sub-Networks

The proposed architecture contains a convolutional layer, an attention mechanism, and sparsely-gated layers. Each block of computation for a single task is crucial for that particular task we train. The researchers observed that even if a block was not crucial for a task, adding it never hurt the performance; and in most cases improved it on all other tasks.

The MultiModal architecture has been trained on the following eight datasets:

  • WSJ speech corpus
  • ImageNet dataset
  • COCO image captioning dataset
  • WSJ parsing dataset
  • WMT English-German translation corpus
  • The reverse of the above: German-English translation
  • WMT English-French translation corpus

The Reverse Of The Above: German-French Translation

The sub networks convert inputs to a joint representation space to allow training on training data of widely different sizes and dimensions. As stated above, the data is images, sound waves, text  and others. The researchers call these sub-networks ‘modality nets’ and these components define transformation between the external domains and the universal representations. The modality nets are designed to be computationally minimal and to ensure that most of the calculations are performed within the domain-agnostic body of the model. The researchers stress that two design decisions were important:

  • That the unified representation be of variable size
  • Different tasks from the same domain shared modality nets. The model avoided creating a sub-network for every task

Let’s Look At Multimodal Architecture And Result

A multimodal net consists of a few small modality-nets, an encoder, I/O mixer, and an autoregressive decoder. It also has three key computational blocks to get good performance across different problems:

  • Convolutions allow the model to detect local patterns and generalize across space
  • Attention layers allow to focus on specific elements to improve performance of the model
  • Sparsely-gated mixture-of-experts gives the model capacity without excessive computation cost.

The multimodal architecture described above was implemented using TensorFlow and was trained using various configurations. In all training runs reported below we used the same set of hyperparameters and the Adam optimiser. The researchers tested the model asking the following questions:

  • How far is the multimodal trained on eight tasks simultaneously from state-of-the-art results?
  • How does training on eight tasks simultaneously compare to training on each task separately?
  • How do the different computational blocks discussed above influence different tasks?

For the first question, researchers compared the performance of the eight-problem multimodal with state-of-the art results. For the second, they compared the multimodal which was trained jointly with multimodal trained separately just on a single task. For the third one, the researchers checked how training without the mixture-of-experts layers or without the attention mechanism influences performance on different problems.

The results achieved by the architecture were similar to the ones which task-specific models got without any heavy tuning. The jointly trained model turned out to perform very closely to individually-trained models on tasks where large amounts of data were available. But this is where it gets interesting: it performs better significantly, on tasks where less data is available, such as parsing. The third question which tries to understand what happens when some blocks or components are excluded. Again, the results are encouraging, even on the ImageNet task, the presence of blocks does not decrease performance, and may even slightly improve it.


According to researchers Lukasz Kaiser, Senior Research Scientist at Google Brain and Aidan N Gomez from University of Toronto, the research successfully demonstrated that a single deep learning model can jointly learn a number of large-scale tasks from multiple domains. “The key to success comes from designing a multi-modal architecture in which as many parameters as possible are shared and from using computational blocks from different domains together. We believe that this treads a path towards interesting future work on more general deep learning architectures, especially since our model shows transfer learning from tasks with a large amount of available data to ones where the data is limited,” they shared.

Download our Mobile App

Abhijeet Katte
As a thorough data geek, most of Abhijeet's day is spent in building and writing about intelligent systems. He also has deep interests in philosophy, economics and literature.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.