Listen to this story
Meta AI has introduced a direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step, to enable faster inference and support translation between unwritten languages. The method outperforms previous approaches and is the first direct S2ST system trained on real-world open sourced audio data instead of synthetic audio for multiple language pairs.
How does it work
Recent speech-to-speech modeling work takes the same approach as traditional text-to-speech synthesis. These models directly translate source speech into target speech spectrograms, which are the spectrum of frequencies represented as multidimensional continuous-value vectors. It can be difficult to train translation models using speech spectrograms as the target, however, because they must learn several different aspects of the relationship between two languages. (How they align with one another, for example, and how their acoustic and linguistic characteristics compare.)
Instead of spectrograms, the researchers used discretized speech units obtained from the clustering of self-supervised speech representations. Compared with spectrograms, discrete units can disentangle linguistic content from prosodic speech information and take advantage of existing natural language processing modeling techniques. Using discretized speech units, we’ve produced three notable advancements: Our S2ST system outperforms previous direct S2ST systems; it is the first direct S2ST system trained on real S2ST data for multiple language pairs; and it leverages pre-training with unlabeled speech data.
To facilitate direct speech-to-speech translation with discrete units (audio samples), we use self-supervised discrete units as targets (speech-to-unit translation, or S2UT) for training the direct S2ST system. In the graphic below, we propose a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder that incorporates auxiliary tasks (shown in dashed lines).
S2ST model with discrete units (Source: Meta AI)
The method was developed using the Fisher Spanish-English speech translation corpus consisting of 139K sentences from telephone conversations in Spanish and transcribed in Spanish and English.
The baseline direct model can work in a textless setup by using discrete units in the source language as the auxiliary task target. This helps to achieve significant improvement compared with 6.7 BLEU.