Last updated June 14, 2022
In AI News & Update

Meta develops a novel speech-to-speech translation system without intermediate text generation

Instead of spectrograms, the researchers used discretized speech units obtained from the clustering of self-supervised speech representations.

Share

Published on June 14, 2022

by Tasmia Ansari

Listen to this story

Meta AI has introduced a direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step, to enable faster inference and support translation between unwritten languages. The method outperforms previous approaches and is the first direct S2ST system trained on real-world open sourced audio data instead of synthetic audio for multiple language pairs.

How does it work

Recent speech-to-speech modeling work takes the same approach as traditional text-to-speech synthesis. These models directly translate source speech into target speech spectrograms, which are the spectrum of frequencies represented as multidimensional continuous-value vectors. It can be difficult to train translation models using speech spectrograms as the target, however, because they must learn several different aspects of the relationship between two languages. (How they align with one another, for example, and how their acoustic and linguistic characteristics compare.)

Instead of spectrograms, the researchers used discretized speech units obtained from the clustering of self-supervised speech representations. Compared with spectrograms, discrete units can disentangle linguistic content from prosodic speech information and take advantage of existing natural language processing modeling techniques. Using discretized speech units, we’ve produced three notable advancements: Our S2ST system outperforms previous direct S2ST systems; it is the first direct S2ST system trained on real S2ST data for multiple language pairs; and it leverages pre-training with unlabeled speech data.

To facilitate direct speech-to-speech translation with discrete units (audio samples), we use self-supervised discrete units as targets (speech-to-unit translation, or S2UT) for training the direct S2ST system. In the graphic below, we propose a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder that incorporates auxiliary tasks (shown in dashed lines).

S2ST model with discrete units (Source: Meta AI)

The method was developed using the Fisher Spanish-English speech translation corpus consisting of 139K sentences from telephone conversations in Spanish and transcribed in Spanish and English.

The baseline direct model can work in a textless setup by using discrete units in the source language as the auxiliary task target. This helps to achieve significant improvement compared with 6.7 BLEU.

Access all our open Survey & Awards Nomination forms in one place

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.