MITB Banner

DeepMind Extends Abilities Of Machines To Generate High Fidelity Speech With GAN-TTS

Share

GANs have achieved state-of-the-art results in image and video generation, and have been successfully applied for unsupervised feature learning among many other applications. Generative adversarial networks have seen rapid development in recent years, however, their audio generation prowess has largely gone unnoticed. 

In an attempt to explore the audio generation abilities of GANs, a team of DeepMind researchers published a work where they introduce a new model called GAN-TTS. 

Audio Generation With Deep Learning So Far

Text-to-Speech (TTS) is a process for converting text into a humanlike voice output.

Many audio generation models operate in the waveform domain. They directly model the amplitude of the waveform as it evolves over time. Autoregressive models achieve this by factorising the joint distribution into a product of conditional distributions. 

Whereas, the invertible feed-forward model can be trained by distilling an autoregressive model using probability density distillation.

Models like Deep Voice 2 and 3 and Tacotron 2, in the past, have achieved some accuracy by first generating a representation of the desired output, and then using a separate autoregressive model to turn it into a waveform and fill in any missing information. However, since the outputs are imperfect, the waveform model has the additional task of correcting any mistakes.

GANs too have been explored. WaveGAN and GANSynth, have both successfully applied GANs but to much simpler datasets of audio data.

The authors believe that GANs have not yet been applied for large scale audio generation operations. With GAN-TTS, they try to do the same.

Overview Of GAN-TTS

 

via Paper by DeepMind

GAN-TTS is a Generative Adversarial Network for text-conditional high-fidelity speech synthesis. Its feed-forward generator is a convolutional neural network, as shown in the figure above, is coupled with an ensemble of multiple discriminators which evaluate the generated (and real) audio based on multi-frequency random windows.

The inner workings of the architecture in both generator and discriminator can be summarised as follows:

  • The generator has seven “GBlocks,” each containing two skip connections: the first performs upsampling if the output frequency is higher than the input.
  • The second contains a size-1 convolution when the number of output channels does not match the input channels.
  • The convolutions are preceded by Conditional Batch Normalisation. Blocks 3–7 gradually up-sample the temporal dimension of hidden representations.
  • The final convolutional layer with Tanh activation then produces a single-channel audio waveform.
  • Whereas, the discriminators consists of blocks (DBlocks) that are similar to the GBlocks used in the generator, but without batch normalisation.
  • Instead of a single discriminator, an ensemble of Random Window Discriminators (RWDs) was used.
  • Notably, the number of discriminators only affects the training computation requirements, as at inference time only the generator network is used.
  • In the first layer of each discriminator, the input raw waveform is downsampled to a constant temporal dimension.
  • The conditional discriminators have access to linguistic and pitch features and can measure whether the generated audio matches the input conditioning.

The results from the experiments show that the GAN-TTS is capable of generating highly-fidelity speech, with the best model achieving a MOS score of 4.2, only 0.2 below state-of-the-art performance.

Conclusion

The researchers believe that the use of RWD is the game-changer here although they say that don’t know the reason behind this. 

They posit that RWDs work much better than the full discriminator because of the relative simplicity of the distributions that the former must discriminate between, and the number of different samples one can draw from these distributions.

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to state-of-the-art models, and unlike autoregressive models, it is highly parallelizable thanks to an efficient feed-forward generator.

Though the widely popular WaveNet has been around for a while, it largely depends on the sequential generation of one audio sample at a time, which is undesirable for present-day applications. GANs, however, with their parallelizable traits, make for a much better option for generating audio from text. 

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.