Meta develops a novel speech-to-speech translation system without intermediate text generation

Instead of spectrograms, the researchers used discretized speech units obtained from the clustering of self-supervised speech representations.
Listen to this story

Meta AI has introduced a direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step, to enable faster inference and support translation between unwritten languages. The method outperforms previous approaches and is the first direct S2ST system trained on real-world open sourced audio data instead of synthetic audio for multiple language pairs.

How does it work

Recent speech-to-speech modeling work takes the same approach as traditional text-to-speech synthesis. These models directly translate source speech into target speech spectrograms, which are the spectrum of frequencies represented as multidimensional continuous-value vectors. It can be difficult to train translation models using speech spectrograms as the target, however, because they must learn several different aspects of the relationship between two languages. (How they align with one another, for example, and how their acoustic and linguistic characteristics compare.)


Sign up for your weekly dose of what's up in emerging technology.

Instead of spectrograms, the researchers used discretized speech units obtained from the clustering of self-supervised speech representations. Compared with spectrograms, discrete units can disentangle linguistic content from prosodic speech information and take advantage of existing natural language processing modeling techniques. Using discretized speech units, we’ve produced three notable advancements: Our S2ST system outperforms previous direct S2ST systems; it is the first direct S2ST system trained on real S2ST data for multiple language pairs; and it leverages pre-training with unlabeled speech data.

To facilitate direct speech-to-speech translation with discrete units (audio samples), we use self-supervised discrete units as targets (speech-to-unit translation, or S2UT) for training the direct S2ST system. In the graphic below, we propose a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder that incorporates auxiliary tasks (shown in dashed lines).

S2ST model with discrete units (Source: Meta AI) 

The method was developed using the Fisher Spanish-English speech translation corpus consisting of 139K sentences from telephone conversations in Spanish and transcribed in Spanish and English.

The baseline direct model can work in a textless setup by using discrete units in the source language as the auxiliary task target. This helps to achieve significant improvement compared with 6.7 BLEU.

More Great AIM Stories

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.