An AI model that can understand real-world situations and communicate with humans in natural language is the holy grail of Artificial Intelligence. Building such agents would require the development of multimodal systems that can understand audio-visual input.
With the introduction of large-scale benchmarks for interpreting audiovisual information and translating the received input into natural language, the timelines towards a general intelligence AI are shortening.
Recently, researchers from Facebook, Columbia University, and Georgia Tech University have developed VX2TEXT to convert semantic signals from video, text, or audio to a common semantic language humans can understand.
VX2TEXT is a framework for generating text from multimodal inputs such as videos, speech and audio. Humans can understand context based on inputs from two different modes. For example, visual and text. However, machines cannot.
The new approach enables language models to directly interpret multimodal data and opens up the possibility of multimodal fusion; combining signals from different multimodal features (text, audio, video) to enhance classification.
Multimodal learning includes all the inputs in the learning process.
What Is VX2TEXT?
To perform well on benchmarks, an AI model must be able to:
- Extract features from each modality
- Combine different inputs (audio, video, or text) to address the query at hand
- Present the result in a text format comprehensible to humans.
The proposed VX2TEXT model accomplishes the listed objectives in a unified and end-to-end trainable framework using modality-specific classifiers.
The approach takes the textual labels of the top classes predicted by classifiers pretrained on existing datasets and translates them into word embedding using a pre-trained language model.
What this model can achieve is significant because it carries multimodal fusion through language encoders such as Google’s T5, a neural network model that can convert language problems into a text-to-text format. Unlike other models, the V2XTEXT approach achieves multimodal fusion without cross-modal network modules. This makes the design simpler and improves performance.
To present the result in text comprehensible for humans, the model employs a generative text decoder which transforms the multimodal features by encoding it to text. While other multimodal-based models work on encoder-only architectures, V2XTEXT uses an open-ended sentence generation approach. Apart from the encoder, it uses a text decoder that tackles ‘video+x to text’ problems such as generating questions, answering them and captioning. All these tasks are carried out by the same architecture, eliminating any requirement for specialised network heads for each task.
The model was evaluated by studying the effect of individual modalities on video-based text generation performance. This evaluation was done by training and testing the model with different combinations of inputs.
The team analysed the performance of V2XTEXT model on three different benchmarks–AVSD, TVQA, and TVC
- With Audio-Visual Scene Aware Dialog (AVSD), it was observed that the VX2TEXT performs better than other models, both with and without text caption as input.
- Even without additional multimodal pretext training data, it was observed that the new approach showed an improvement of 1.4 percent on the previous state-of-art models with TVQA (Question-Answer dataset) benchmark
- On the captioning task of TVC benchmark, V2XTEXT could outperform state-of-art multimodal machine translation (MMT) systems.
Read the full paper here.