Google’s Latest Model To Recreate Natural Voices Of People Is Totally Magical

Developed by Google’s Project Euphonia, the PnG NAT model helps recreate natural voices of people with speech impairments.
Google text-to-speech

June 02 was a rather eventful day for the US’ Major League Baseball. It celebrated Lou Gehrig Day, remembering the Iron Horse of Yankees on the day he started the first baseman and the day he passed away from amyotrophic lateral sclerosis, or ALS. 

Also known as Lou Gehrig’s disease or motor neurone disease, ALS is a progressive nervous disease affecting the brain and spinal cord nerve cells resulting in the loss of muscle control, and ultimately, the patient’s ability to speak, eat, move, and even breathe. Honouring the National Football League player, former American football player and advocate of ALS, Steve Gleason, recited Gehrig’s famous Luckiest Man speech. What was significant about the recitation was that Gleason, also battling ALS, used a machine learning (ML) model to recreate his natural voice. 

ML Model for Natural Voice Recreation

Developed in collaboration with Google’s Project Euphonia (a Google Research initiative to help people with atypical speech), this ML model aims to empower people with impaired speaking ability resulting from ALS to be better understood. 

The ML model, called the PnG NAT model, is a text-to-speech synthesis (TTS) model that merges two technologies — PnG BERT and Non-Attentive Tacotron (NAT) into one single model. Google suggests that the cumulation of these two technologies into one demonstrated better quality and fluency than the previous technologies, exhibiting greater possibilities. 

In its recently published blog post, Google explains the model. 

Text-to-Speech Synthesis

NAT is a sequence-to-sequence neural TTS model and a successor to Google’s Tacotron 2

Tacotron 2 helps generate human-like speech from text using neural networks that are trained using only speech examples and corresponding to text transcripts. Tacotron 2 used a particular attention module to connect the input text sequence and output speech spectrogram frame sequence. Thus, enabling the model to know the important portions of the text and pay attention to those parts when generating each time step of the synthesised speech spectrogram. It was the first TTS model to successfully synthesise speech as natural as a human speaking. However, the team at Google realised one shortcoming of the model — it had the probability of suffering from robustness. That is, the model blabbered, repeated and also skipped parts of the text arising from its inherent flexibility of the attention mechanism. 

This is where the improved version, NAT, comes in. 

Source: Google

NAT replaces the attention module in Tacotron 2 with a duration-based upsampler. The upsampler predicts the duration of each input phoneme and upsamples the encoded phoneme representation, ensuring that the output length is the same as the length of the predicted speech spectrogram, thus, removing the robustness issue and also improving the quality of the synthesised speech, making it more natural. People with ALS often have disfluent speech. To this, NAT enables accurate control of the speech duration to achieve fluency of the recreated voice. 

PnG BERT

To improve the natural understanding of TTS input, the Google Research team has applied PnG BERT, which is pre-trained with self-supervision on both the phoneme representation and the grapheme representation of the same content from a large text corpus. It is then used as the encoder of the TSS model, resulting in improved pronunciation of the synthesised speech.

PnG NAT model integrates pre-trained PnG BERT as the encoder to the NAT model. NAT uses output from the encoder to predict the duration of each phoneme and then unsampled to match the length of the spectrogram. Finally, a non-attentive decoder converts the unsampled hidden representations into audio speech spectrograms to convert them into audio waveforms by a neural vocoder. 

 Source: Google

Source: Google

Summing up

In order to recreate Gleason’s voice, Google researchers first trained a PnG NAT model with recordings from 31 professional speakers. They then fine-tuned them with 30 minutes of Gleason’s recordings (which contain slurring since the recordings were done after Gleason was diagnosed with ALS, resulting in some disfluency).

However, to further naturalise it and improve the quality, the team leveraged the phoneme duration control of NAT and the model trained with professional speakers. Predicting the duration of each phoneme for a professional speaker and Gleason, the researchers used the geometric mean of the two values for each phoneme to guide the NAT output. The model was thus able to speak in Gleason’s voice and more fluently. 

Google, Google research, Project Euphonia, Natural voice recreation, text-to-speech synthesis, PnG BERT, Non-attentive tacotron, Tacotron 2

More Great AIM Stories

Debolina Biswas
After diving deep into the Indian startup ecosystem, Debolina is now a Technology Journalist. When not writing, she is found reading or playing with paint brushes and palette knives. She can be reached at debolina.biswas@analyticsindiamag.com

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Yugesh Verma
How is Boolean algebra used in Machine learning?

Machine learning model with Boolean algebra starts with the data with a target variable and input or learner variables and using the set of rules it generates output value by considering a given configuration of input samples.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM