Meta AI’s MuAViC Sets New Benchmark for Highly Accurate Speech Translation

MuAViC can deliver superior speech translation in challenging amid noisy environments.
Major highlights from Meta’s Inside the Lab event
Listen to this story

Meta AI has unveiled a new benchmark called MuAViC (Multilingual Audio-Visual Corpus) that incorporates audio-visual learning to achieve highly accurate speech translation, revamping speech translation. 

Based on their previous AI models such as AV-HuBERT and RAVen models that use visual information to improve English speech recognition, through MuAViC, Meta AI has trained its AV-HuBERT model to deliver superior speech translation in challenging amid noisy environments.

The model can effortlessly handle noise, with the visual modality being relied upon more heavily if the audio modality is distorted. The models were tested in noisy and noise-free environments against a top-performing model for speech recognition and X-En speech translation tasks.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Read the full paper here

Training Process


Download our Mobile App



The shortage of adequate training data previously hindered the exploration of audio-visual understanding for speech translation. Compared to audio data alone, gathering and processing audio-video data requires more resources.

The MuAViC is the most extensive multilingual benchmark for audio-video speech recognition because it comprises about 1,200 hours of transcribed data across nine languages.

For speeches in English, the team repurposed the audio-visual data from the LRS3 dataset and aligned it with a machine translation corpus using a text-matching algorithm. They matched the examples with the corresponding target sentences in the machine translation corpus to generate translation labels and employed exact text matching for examples in the development and test sets. For training set examples without matches, they used a machine translation model to obtain pseudo-translation labels.

On the other hand, speeches in non-English languages, Meta used the audio-only data, transcriptions, and text translations from the speech translation dataset. The team obtained video tracks of the original recordings and aligned processed video data with the audio data to create audio-visual data. Although all audio data is transcribed, we only translated a subset of it. To create pseudo-translation labels, they also utilized the same machine translation model as earlier.

The team utilized Meta’s AV-HuBERT design to produce speech recognition and speech translation models that process both audio and video data end-to-end. When given paired audio and video inputs, the model combines their representations into a single space that can be used for either task. Even if one modality is absent, the AV-HuBERT can still process the available data, albeit with reduced efficiency. 

Last week, Meta released LLaMA, a set of foundation language models that range from 7B to 65B parameters, but it got leaked, along with its weights and is now available to download through torrents. Christopher King, a GitHub user, submitted a pull request to the LLaMA GitHub page which included a torrent link to the open model. LLaMA-13B surpasses OpenAI’s GPT-3 (175B) while being over ten times smaller, and LLaMA-65B is comparable to DeepMind’s Chinchilla-70B and Google’s PaLM-540B.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shritama Saha
Shritama is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.