MITB Banner

Google Upgrades Translatotron, Its Speech-to-Speech Translation Model

Google claims the revised version can successfully transfer voice even when the input speech consists of multiple speakers.

Share

Google AI has introduced the second version of Translatotron, their S2ST model that can directly translate speech between two different languages without the need for many intermediary subsystems. 

Automatically generated S2ST systems are made up of speech recognition, machine translation, and speech synthesis subsystems. Given this, the cascade systems suffer the challenge of potential longer latency, loss of information, and compounding errors between subsystems.

To this, Google released Translatotron in 2019, an end-to-end speech-to-speech translation model that the tech giant claimed was the first end-to-end framework to directly translate speech from one language into speech in another language.

The single sequence-to-sequence model system was used to create synthesised translations of voices to ensure the sound of the original speaker was intact. But despite its ability to automatically produce human-like speech, it underperformed compared to a strong baseline cascade S2ST system. 

Translatotron 2

In response, Google introduced ‘Translatotron 2’, an updated model version with improved performance and a new method for transferring the voice to the translated speech. In addition, Google claims the revised version can successfully transfer voice even when the input speech consists of multiple speakers. Tests confirmed this on three corpora that validated that Translatotron 2 outperforms the original Translatotron significantly on translation quality, speech naturalness, and speech robustness.

The model is also better aligning with AI principles and secure, preventing potential misuse. For example, in response to deep fakes being created with the Translatotron, Google’s paper states, “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.” 

Architecture

Main components of Translatotron 2:

  • A speech encoder
  • A target phoneme decoder
  • A target speech synthesiser
  • An attention module – connecting all the components

The architecture follows that of a direct speech to text translation model with the encoder, the attention module and the decoder. In addition, here, the synthesiser is conditioned on the output generated by the attention module and the decoder. 

The model architecture by Google.

How are the two models different?

  • The conditioning difference: In the Translatotron 2, the output from the target phoneme decoder is an input to the spectrogram synthesiser that makes the model easier to train while yielding better performance. The previous model uses the output as an auxiliary loss only. 
  • Spectrogram synthesiser: In the Translatotron 2, the spectrogram synthesiser is ‘duration based’, improving the robustness of the speech. The previous model has an ‘attention based’ spectrogram synthesiser that is known to suffer robustness issues. 
  • Attention driving: While both the models use an attention-based connection for encoding source speech, in Translatotron 2, this is driven by the phoneme decoder. This makes sure that the acoustic information seen by the spectrogram synthesiser is aligned with the translated content being synthesised and retains each speakers’ voice. 

To ensure the model cannot create deep fakes like through the original Translatotron, the 2.0 uses only a single speech encoder to retain the speaker’s voice. This works for both linguistic understanding and voice capture while preventing the reproduction of non-source voices. Furthermore, the team used a modified version of PnG NAT to train the model to retain speaker voices across translation. PnG NAT is a TTS model that can transfer cross-lingual voice to synthesise training targets. Additionally, Google’s modified version of PnG NAT includes a separately trained speaker encoder to ensure the Translatotron 2 can zero-shot voice transference.

ConcatAug

ConcatAug is Google’s proposed concatenation-based data augmentation technique to enable the model to retain each speaker’s voice in the translated speech in the case of multiple speakers in the input speech.

ConcatAug “augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” according to the team. The results then contain two speakers’ voices in both the source and the target speech, and the model learns further based on these examples. 

Performance 

The performance tests verified that Translatotron 2 outperforms the original Translatotron by large margins in aspects of higher translation quality, speech naturalness, and speech robustness. Mainly, the model also excelled on Fisher corpus, a complex Spanish-English translation test. The model’s translation quality and speech quality approaches that of a strong baseline cascade system. 

Listen to the audio samples here

Source Language frdeesca
Translatotron 2 27.018.827.722.5
Translatotron 18.910.818.813.9
ST (Wang et al. 2020) 27.018.928.023.9
Training Target 82.186.085.189.3

Performance on the CoVoST 2 corpus.

Source: Google

Additionally, along with the Spanish-to-English S2ST, the model was evaluated on a multilingual setup. Here, the input speech consisted of four different languages without the input of which language it was. The model successfully detected and translated them into English. 

The research team is positive this makes the Translatotron 2 more applicable for production deployment after the mitigation of potential abuse. 

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.