DeepMind Introduces EATS – An End-to-End Adversarial Text-To-Speech

Recently, researchers at DeepMind proposed EATS, an end-to-end adversarial text-to-speech generative model for TTS trained adversarially. EATS operate on either pure text or raw i.e. temporally unaligned phoneme input sequences and produce raw speech waveforms as output.

Researches on text-to-speech systems have shown impressive growth over a few years. Artificial speech synthesis, commonly known as text-to-speech (TTS) includes a number of applications in domains like technology interfaces, accessibility, entertainment, among others.

A text-to-speech (TTS) system exercises natural language text inputs to generate synthetic human-like speech outputs. Typical TTS pipelines include various stages trained as well as designed independently, for instance, text normalisation, aligned linguistic featurisation, raw audio waveform synthesis, among others.

According to the researchers, although these pipelines have shown competence in realistic as well as high-fidelity speech synthesis, these modular approaches come with various drawbacks. For instance, the pipelines often require supervision at each stage, necessitating expensive annotations to guide the outputs of every stage, thereby, failing to reap the full potential rewards of data-driven “end-to-end” learning.

Behind EATS

In this research, a neural network (the generator) maps an input sequence of characters to raw audio at 24 kHz. However, the task is indeed challenging because the input and output are not aligned, which means it is unknown which output tokens will correspond to each input token.

In order to address these challenges, the generator is divided into two blocks, which are: –

  1. Aligner – An aligner maps the unaligned input sequence to representation that is aligned with the output and includes a lower sample rate of 200 Hz.
  2. Decoder – The decoder upsamples the aligner’s output to the full audio frequency.

The entire generator architecture is a feed-forward convolutional neural network, which makes it well-suited for applications where fast batched inference is important. According to the researchers, the generator is inspired by GAN-TTS, which is a text-to-speech generative adversarial network operating on aligned linguistic features. The researchers further employed the GAN-TTS generator as the decoder in their model, where its input comes from the aligner block.

Dataset Used

The researchers trained all the models on a dataset that includes high-quality recordings of human speech. The speech is performed by various professional voice actors as well as corresponding text. The voice pool includes a total of 69 female and male voices of North American English speakers, and the audio clips contain full sentences of lengths varying from less than 1 to 20 seconds at 24 kHz frequency. Moreover, the individual voices are unevenly distributed, which are accounting to a total of 260.49 hours of recorded speech.

Contributions By The Researchers

According to the researchers, the main contributions of this project are as follows: –

  • In this work, the researchers demonstrated that a text-to-speech system can be learnt nearly end-to-end, resulting in high-fidelity natural-sounding speech that is approaching the state-of-the-art TTS systems.
  • A fully differentiable and efficient feed-forward aligner architecture predicts the duration of each input token as well as produces an audio-aligned representation.
  • The utilisation of flexible dynamic time warping-based prediction losses to enforce alignment with input conditioning while allowing the model to achieve the variability of timing in human speech.
  • The overall system gained a mean opinion score (MOS) of 4.083, that can be said as approaching the state-of-the-art from models trained using richer supervisory signals.

Wrapping Up

The researchers stated, “We have presented an adversarial approach to text-to-speech synthesis which can learn from a relatively weak supervisory signal – normalised text or phonemes paired with corresponding speech audio.” They added, “The speech generated by our proposed model matches the given conditioning texts with naturalness approaching the state-of-the-art systems with multi-stage training pipelines or additional supervision.”

Read the paper here.

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox