Google Releases New Version of Translatotron: Its End-to-end Speech Translation Model

The Translatotron 2 model consists of a source speech encoder, a target phoneme decoder, and a target mel-spectrogram synthesiser.

Google released Translatotron, an end-to-end speech-to-speech translation model, in 2019. The tech giant claimed the single sequence-to-sequence model is the first end-to-end framework to directly translate speech from one language into speech in another language.

The system was used to create synthesised translations of voices to ensure the sound of the original speaker is intact. But this feature had the potential to be misused to generate speech in a different voice and create deep fake voices. 


Sign up for your weekly dose of what's up in emerging technology.

This month, researchers at Google published a paper detailing ‘Translatotron 2’, an updated version that solves the deep fake problems. “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.” 

Additionally, the newer model outperformed the Translatoron by “a large margin” in translation quality and predicted speech naturalness. It has also improved the robustness by cutting down babbling and long pauses.

As per the paper, Translatoron 2 differs from its predecessor in two significant ways:

 1) The output from the auxiliary target phoneme decoder is used as an input to the spectrogram synthesiser

2) The spectrogram synthesiser is duration-based, while still keeping the benefits of the attention mechanism.

The Translatotron 2 model consists of a source speech encoder, a target phoneme decoder, and a target mel-spectrogram synthesiser, connected by an attention module. It is trained with speech-to-speech and speech-to-phoneme translation objectives. 

Listen to the audio samples here

For every piece of data the encoder and decoder process, the attention module weighs the relevance of every other bit of data to generate an output. The encoder and decoder are uSed to create a numerical representation and a corresponding phoneme sequence of the translated speech. The synthesiser later takes the decoder and attention module output as its input and synthesises the translated voice.

The process can be understood with this diagram:

The new method of voice retention prevents the system from generating speech in a different speaker’s voice. Unlike the retaining voice system in Translatoron, the newer model functions without reliance on explicit IDs to identify the speaker. Instead, the model retains only the source speaker’s voice without generating the speech in another speaker’s voice. 

The research team believes this makes Translatotron 2 more applicable for production deployment by mitigating potential abuse for creating deep fakes or spoofed voices. 

“To enable direct S2ST models to preserve each speaker’s voice for input with speaker turns, we augmented the training data by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” the researchers wrote in the paper. “The resulting new examples contain two speakers’ voices in both the source and the target speech, which enables the model to learn on examples with speaker turns.” 

The Translatoron came with increasing concerns over deep fakes. According to Deeptrace, a Dutch startup, the number of deep fakes on the web doubled in nine months in 2019, and 330% from then to June 2020. The most recent application of Anthony Bourdain’s AI-generated voice has sparked controversy.

In addition, identifying fakes is getting harder with improving technology. In a recent report, The Brookings Institution outlined the range of political and social dangers deep fakes pose: “Distorting democratic discourse; manipulating elections; eroding trust in institutions; weakening journalism; exacerbating social divisions; undermining public safety; and inflicting hard-to-repair damage on the reputation of prominent individuals, including elected officials and candidates for office.”

Even the FBI warned that deepfakes are a critical emerging threat targeting businesses earlier this year.

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.