Google Releases New Version of Translatotron: Its End-to-end Speech Translation Model

The Translatotron 2 model consists of a source speech encoder, a target phoneme decoder, and a target mel-spectrogram synthesiser.

Google released Translatotron, an end-to-end speech-to-speech translation model, in 2019. The tech giant claimed the single sequence-to-sequence model is the first end-to-end framework to directly translate speech from one language into speech in another language.

The system was used to create synthesised translations of voices to ensure the sound of the original speaker is intact. But this feature had the potential to be misused to generate speech in a different voice and create deep fake voices. 

This month, researchers at Google published a paper detailing ‘Translatotron 2’, an updated version that solves the deep fake problems. “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.” 


Sign up for your weekly dose of what's up in emerging technology.

Additionally, the newer model outperformed the Translatoron by “a large margin” in translation quality and predicted speech naturalness. It has also improved the robustness by cutting down babbling and long pauses.

As per the paper, Translatoron 2 differs from its predecessor in two significant ways:

 1) The output from the auxiliary target phoneme decoder is used as an input to the spectrogram synthesiser

2) The spectrogram synthesiser is duration-based, while still keeping the benefits of the attention mechanism.

The Translatotron 2 model consists of a source speech encoder, a target phoneme decoder, and a target mel-spectrogram synthesiser, connected by an attention module. It is trained with speech-to-speech and speech-to-phoneme translation objectives. 

Listen to the audio samples here

For every piece of data the encoder and decoder process, the attention module weighs the relevance of every other bit of data to generate an output. The encoder and decoder are uSed to create a numerical representation and a corresponding phoneme sequence of the translated speech. The synthesiser later takes the decoder and attention module output as its input and synthesises the translated voice.

The process can be understood with this diagram:

The new method of voice retention prevents the system from generating speech in a different speaker’s voice. Unlike the retaining voice system in Translatoron, the newer model functions without reliance on explicit IDs to identify the speaker. Instead, the model retains only the source speaker’s voice without generating the speech in another speaker’s voice. 

The research team believes this makes Translatotron 2 more applicable for production deployment by mitigating potential abuse for creating deep fakes or spoofed voices. 

“To enable direct S2ST models to preserve each speaker’s voice for input with speaker turns, we augmented the training data by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” the researchers wrote in the paper. “The resulting new examples contain two speakers’ voices in both the source and the target speech, which enables the model to learn on examples with speaker turns.” 

The Translatoron came with increasing concerns over deep fakes. According to Deeptrace, a Dutch startup, the number of deep fakes on the web doubled in nine months in 2019, and 330% from then to June 2020. The most recent application of Anthony Bourdain’s AI-generated voice has sparked controversy.

In addition, identifying fakes is getting harder with improving technology. In a recent report, The Brookings Institution outlined the range of political and social dangers deep fakes pose: “Distorting democratic discourse; manipulating elections; eroding trust in institutions; weakening journalism; exacerbating social divisions; undermining public safety; and inflicting hard-to-repair damage on the reputation of prominent individuals, including elected officials and candidates for office.”

Even the FBI warned that deepfakes are a critical emerging threat targeting businesses earlier this year.

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM