Google released Translatotron, an end-to-end speech-to-speech translation model, in 2019. The tech giant claimed the single sequence-to-sequence model is the first end-to-end framework to directly translate speech from one language into speech in another language.
The system was used to create synthesised translations of voices to ensure the sound of the original speaker is intact. But this feature had the potential to be misused to generate speech in a different voice and create deep fake voices.
This month, researchers at Google published a paper detailing ‘Translatotron 2’, an updated version that solves the deep fake problems. “The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artefacts.”
Translatotron 2: Robust direct speech-to-speech translation
— AK (@ak92501) July 20, 2021
pdf: https://t.co/9IPIWOwWac
samples: https://t.co/TEXw3z59O2
outperforms Translatotron by a large margin in terms of translation quality and predicted speech naturalness pic.twitter.com/dQ97yE9iow
Additionally, the newer model outperformed the Translatoron by “a large margin” in translation quality and predicted speech naturalness. It has also improved the robustness by cutting down babbling and long pauses.
As per the paper, Translatoron 2 differs from its predecessor in two significant ways:
1) The output from the auxiliary target phoneme decoder is used as an input to the spectrogram synthesiser
2) The spectrogram synthesiser is duration-based, while still keeping the benefits of the attention mechanism.
The Translatotron 2 model consists of a source speech encoder, a target phoneme decoder, and a target mel-spectrogram synthesiser, connected by an attention module. It is trained with speech-to-speech and speech-to-phoneme translation objectives.
Listen to the audio samples here.
For every piece of data the encoder and decoder process, the attention module weighs the relevance of every other bit of data to generate an output. The encoder and decoder are uSed to create a numerical representation and a corresponding phoneme sequence of the translated speech. The synthesiser later takes the decoder and attention module output as its input and synthesises the translated voice.
The process can be understood with this diagram:
The new method of voice retention prevents the system from generating speech in a different speaker’s voice. Unlike the retaining voice system in Translatoron, the newer model functions without reliance on explicit IDs to identify the speaker. Instead, the model retains only the source speaker’s voice without generating the speech in another speaker’s voice.
The research team believes this makes Translatotron 2 more applicable for production deployment by mitigating potential abuse for creating deep fakes or spoofed voices.
“To enable direct S2ST models to preserve each speaker’s voice for input with speaker turns, we augmented the training data by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples,” the researchers wrote in the paper. “The resulting new examples contain two speakers’ voices in both the source and the target speech, which enables the model to learn on examples with speaker turns.”
The Translatoron came with increasing concerns over deep fakes. According to Deeptrace, a Dutch startup, the number of deep fakes on the web doubled in nine months in 2019, and 330% from then to June 2020. The most recent application of Anthony Bourdain’s AI-generated voice has sparked controversy.
In addition, identifying fakes is getting harder with improving technology. In a recent report, The Brookings Institution outlined the range of political and social dangers deep fakes pose: “Distorting democratic discourse; manipulating elections; eroding trust in institutions; weakening journalism; exacerbating social divisions; undermining public safety; and inflicting hard-to-repair damage on the reputation of prominent individuals, including elected officials and candidates for office.”
Even the FBI warned that deepfakes are a critical emerging threat targeting businesses earlier this year.