Google AI has introduced the second version of Translatotron, their S2ST model that can directly translate speech between two different languages without the need for many intermediary subsystems.
Automatically generated S2ST systems are made up of speech recognition, machine translation, and speech synthesis subsystems. Given this, the cascade systems suffer the challenge of potential longer latency, loss of information, and compounding errors between subsystems.
To this, Google released Translatotron in 2019, an end-to-end speech-to-speech translation model that the tech giant claimed was the first end-to-end framework to directly translate speech from one language into speech in another language.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The single sequence-to-sequence model system was used to create synthesised translations of voices to ensure the sound of the original speaker was intact. But despite its ability to automatically produce human-like speech, it underperformed compared to a strong baseline cascade S2ST system.

Translatotron 2
In response, Google introduced ‘Translatotron 2’, an updated model version with improved performance and a new method for transferring the voice to the translated speech. In addition, Google claims the revised version can successfully transfer voice even when the input speech consists of multiple speakers. Tests confirmed this on three corpora that validated that Translatotron 2 outperforms the original Translatotron significantly on translation quality, speech naturalness, and speech robustness.
The model is also better aligning with AI principles and secure, preventing potential misuse. For example, in response to deep fakes being created with the Translatotron, Google’s paper states, “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.”
Architecture
Main components of Translatotron 2:
- A speech encoder
- A target phoneme decoder
- A target speech synthesiser
- An attention module – connecting all the components
The architecture follows that of a direct speech to text translation model with the encoder, the attention module and the decoder. In addition, here, the synthesiser is conditioned on the output generated by the attention module and the decoder.
The model architecture by Google.
How are the two models different?
- The conditioning difference: In the Translatotron 2, the output from the target phoneme decoder is an input to the spectrogram synthesiser that makes the model easier to train while yielding better performance. The previous model uses the output as an auxiliary loss only.
- Spectrogram synthesiser: In the Translatotron 2, the spectrogram synthesiser is ‘duration based’, improving the robustness of the speech. The previous model has an ‘attention based’ spectrogram synthesiser that is known to suffer robustness issues.
- Attention driving: While both the models use an attention-based connection for encoding source speech, in Translatotron 2, this is driven by the phoneme decoder. This makes sure that the acoustic information seen by the spectrogram synthesiser is aligned with the translated content being synthesised and retains each speakers’ voice.
To ensure the model cannot create deep fakes like through the original Translatotron, the 2.0 uses only a single speech encoder to retain the speaker’s voice. This works for both linguistic understanding and voice capture while preventing the reproduction of non-source voices. Furthermore, the team used a modified version of PnG NAT to train the model to retain speaker voices across translation. PnG NAT is a TTS model that can transfer cross-lingual voice to synthesise training targets. Additionally, Google’s modified version of PnG NAT includes a separately trained speaker encoder to ensure the Translatotron 2 can zero-shot voice transference.
ConcatAug
ConcatAug is Google’s proposed concatenation-based data augmentation technique to enable the model to retain each speaker’s voice in the translated speech in the case of multiple speakers in the input speech.
ConcatAug “augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” according to the team. The results then contain two speakers’ voices in both the source and the target speech, and the model learns further based on these examples.
Performance
The performance tests verified that Translatotron 2 outperforms the original Translatotron by large margins in aspects of higher translation quality, speech naturalness, and speech robustness. Mainly, the model also excelled on Fisher corpus, a complex Spanish-English translation test. The model’s translation quality and speech quality approaches that of a strong baseline cascade system.
Listen to the audio samples here.
Source Language | fr | de | es | ca |
Translatotron 2 | 27.0 | 18.8 | 27.7 | 22.5 |
Translatotron | 18.9 | 10.8 | 18.8 | 13.9 |
ST (Wang et al. 2020) | 27.0 | 18.9 | 28.0 | 23.9 |
Training Target | 82.1 | 86.0 | 85.1 | 89.3 |
Performance on the CoVoST 2 corpus.
Source: Google
Additionally, along with the Spanish-to-English S2ST, the model was evaluated on a multilingual setup. Here, the input speech consisted of four different languages without the input of which language it was. The model successfully detected and translated them into English.
The research team is positive this makes the Translatotron 2 more applicable for production deployment after the mitigation of potential abuse.