New AI Framework Can Translate Multimodal Inputs, Including Video, To Natural Language

An AI model that can understand real-world situations and communicate with humans in natural language is the holy grail of Artificial Intelligence. Building such agents would require the development of multimodal systems that can understand audio-visual input.

With the introduction of large-scale benchmarks for interpreting audiovisual information and translating the received input into natural language, the timelines towards a general intelligence AI are shortening.

Recently, researchers from Facebook, Columbia University, and Georgia Tech University have developed VX2TEXT to convert semantic signals from video, text, or audio to a common semantic language humans can understand.

VX2TEXT is a framework for generating text from multimodal inputs such as videos, speech and audio. Humans can understand context based on inputs from two different modes. For example, visual and text. However, machines cannot.

The new approach enables language models to directly interpret multimodal data and opens up the possibility of multimodal fusion; combining signals from different multimodal features (text, audio, video) to enhance classification.

Multimodal learning includes all the inputs in the learning process. 

What Is VX2TEXT?

To perform well on benchmarks, an AI model must be able to:

  • Extract features from each modality
  • Combine different inputs (audio, video, or text) to address the query at hand
  • Present the result in a text format comprehensible to humans.

The proposed VX2TEXT model accomplishes the listed objectives in a unified and end-to-end trainable framework using modality-specific classifiers. 

The approach takes the textual labels of the top classes predicted by classifiers pretrained on existing datasets and translates them into word embedding using a pre-trained language model.

What this model can achieve is significant because it carries multimodal fusion through language encoders such as Google’s T5, a neural network model that can convert language problems into a text-to-text format. Unlike other models, the V2XTEXT approach achieves multimodal fusion without cross-modal network modules. This makes the design simpler and improves performance.

To present the result in text comprehensible for humans, the model employs a generative text decoder which transforms the multimodal features by encoding it to text. While other multimodal-based models work on encoder-only architectures, V2XTEXT uses an open-ended sentence generation approach. Apart from the encoder, it uses a text decoder that tackles ‘video+x to text’ problems such as generating questions, answering them and captioning. All these tasks are carried out by the same architecture, eliminating any requirement for specialised network heads for each task.

Credit: VX2TEXT


The model was evaluated by studying the effect of individual modalities on video-based text generation performance. This evaluation was done by training and testing the model with different combinations of inputs.

The team analysed the performance of V2XTEXT model on three different benchmarks–AVSD, TVQA, and TVC

  • With Audio-Visual Scene Aware Dialog (AVSD), it was observed that the VX2TEXT performs better than other models, both with and without text caption as input.
  • Even without additional multimodal pretext training data, it was observed that the new approach showed an improvement of 1.4 percent on the previous state-of-art models with TVQA (Question-Answer dataset) benchmark
  • On the captioning task of TVC benchmark, V2XTEXT could outperform state-of-art multimodal machine translation (MMT) systems.

Read the full paper here.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox