Google Upgrades Translatotron, Its Speech-to-Speech Translation Model

Google claims the revised version can successfully transfer voice even when the input speech consists of multiple speakers.

Google AI has introduced the second version of Translatotron, their S2ST model that can directly translate speech between two different languages without the need for many intermediary subsystems. 

Automatically generated S2ST systems are made up of speech recognition, machine translation, and speech synthesis subsystems. Given this, the cascade systems suffer the challenge of potential longer latency, loss of information, and compounding errors between subsystems.

To this, Google released Translatotron in 2019, an end-to-end speech-to-speech translation model that the tech giant claimed was the first end-to-end framework to directly translate speech from one language into speech in another language.

The single sequence-to-sequence model system was used to create synthesised translations of voices to ensure the sound of the original speaker was intact. But despite its ability to automatically produce human-like speech, it underperformed compared to a strong baseline cascade S2ST system. 

Translatotron 2

In response, Google introduced ‘Translatotron 2’, an updated model version with improved performance and a new method for transferring the voice to the translated speech. In addition, Google claims the revised version can successfully transfer voice even when the input speech consists of multiple speakers. Tests confirmed this on three corpora that validated that Translatotron 2 outperforms the original Translatotron significantly on translation quality, speech naturalness, and speech robustness.

The model is also better aligning with AI principles and secure, preventing potential misuse. For example, in response to deep fakes being created with the Translatotron, Google’s paper states, “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.” 

Architecture

Main components of Translatotron 2:

  • A speech encoder
  • A target phoneme decoder
  • A target speech synthesiser
  • An attention module – connecting all the components

The architecture follows that of a direct speech to text translation model with the encoder, the attention module and the decoder. In addition, here, the synthesiser is conditioned on the output generated by the attention module and the decoder. 

The model architecture by Google.

How are the two models different?

  • The conditioning difference: In the Translatotron 2, the output from the target phoneme decoder is an input to the spectrogram synthesiser that makes the model easier to train while yielding better performance. The previous model uses the output as an auxiliary loss only. 
  • Spectrogram synthesiser: In the Translatotron 2, the spectrogram synthesiser is ‘duration based’, improving the robustness of the speech. The previous model has an ‘attention based’ spectrogram synthesiser that is known to suffer robustness issues. 
  • Attention driving: While both the models use an attention-based connection for encoding source speech, in Translatotron 2, this is driven by the phoneme decoder. This makes sure that the acoustic information seen by the spectrogram synthesiser is aligned with the translated content being synthesised and retains each speakers’ voice. 

To ensure the model cannot create deep fakes like through the original Translatotron, the 2.0 uses only a single speech encoder to retain the speaker’s voice. This works for both linguistic understanding and voice capture while preventing the reproduction of non-source voices. Furthermore, the team used a modified version of PnG NAT to train the model to retain speaker voices across translation. PnG NAT is a TTS model that can transfer cross-lingual voice to synthesise training targets. Additionally, Google’s modified version of PnG NAT includes a separately trained speaker encoder to ensure the Translatotron 2 can zero-shot voice transference.

ConcatAug

ConcatAug is Google’s proposed concatenation-based data augmentation technique to enable the model to retain each speaker’s voice in the translated speech in the case of multiple speakers in the input speech.

ConcatAug “augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” according to the team. The results then contain two speakers’ voices in both the source and the target speech, and the model learns further based on these examples. 

Performance 

The performance tests verified that Translatotron 2 outperforms the original Translatotron by large margins in aspects of higher translation quality, speech naturalness, and speech robustness. Mainly, the model also excelled on Fisher corpus, a complex Spanish-English translation test. The model’s translation quality and speech quality approaches that of a strong baseline cascade system. 

Listen to the audio samples here

Source Language frdeesca
Translatotron 2 27.018.827.722.5
Translatotron 18.910.818.813.9
ST (Wang et al. 2020) 27.018.928.023.9
Training Target 82.186.085.189.3

Performance on the CoVoST 2 corpus.

Source: Google

Additionally, along with the Spanish-to-English S2ST, the model was evaluated on a multilingual setup. Here, the input speech consisted of four different languages without the input of which language it was. The model successfully detected and translated them into English. 

The research team is positive this makes the Translatotron 2 more applicable for production deployment after the mitigation of potential abuse. 

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Yugesh Verma
Complete Tutorial on Parts Of Speech (PoS) Tagging

Classifying words in their part of speech and providing them labels according to their part of speech is called part of speech tagging or POS tagging OR POST.  Hence the set of labels/tags is called a tagset. Next in the article, we will discuss how we can implement that POST part of any NLP task

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM