Google’s Latest Model To Recreate Natural Voices Of People Is Totally Magical

Developed by Google’s Project Euphonia, the PnG NAT model helps recreate natural voices of people with speech impairments.
Google text-to-speech

June 02 was a rather eventful day for the US’ Major League Baseball. It celebrated Lou Gehrig Day, remembering the Iron Horse of Yankees on the day he started the first baseman and the day he passed away from amyotrophic lateral sclerosis, or ALS. 

Also known as Lou Gehrig’s disease or motor neurone disease, ALS is a progressive nervous disease affecting the brain and spinal cord nerve cells resulting in the loss of muscle control, and ultimately, the patient’s ability to speak, eat, move, and even breathe. Honouring the National Football League player, former American football player and advocate of ALS, Steve Gleason, recited Gehrig’s famous Luckiest Man speech. What was significant about the recitation was that Gleason, also battling ALS, used a machine learning (ML) model to recreate his natural voice. 

ML Model for Natural Voice Recreation

Developed in collaboration with Google’s Project Euphonia (a Google Research initiative to help people with atypical speech), this ML model aims to empower people with impaired speaking ability resulting from ALS to be better understood. 


Sign up for your weekly dose of what's up in emerging technology.

The ML model, called the PnG NAT model, is a text-to-speech synthesis (TTS) model that merges two technologies — PnG BERT and Non-Attentive Tacotron (NAT) into one single model. Google suggests that the cumulation of these two technologies into one demonstrated better quality and fluency than the previous technologies, exhibiting greater possibilities. 

In its recently published blog post, Google explains the model. 

Text-to-Speech Synthesis

NAT is a sequence-to-sequence neural TTS model and a successor to Google’s Tacotron 2

Tacotron 2 helps generate human-like speech from text using neural networks that are trained using only speech examples and corresponding to text transcripts. Tacotron 2 used a particular attention module to connect the input text sequence and output speech spectrogram frame sequence. Thus, enabling the model to know the important portions of the text and pay attention to those parts when generating each time step of the synthesised speech spectrogram. It was the first TTS model to successfully synthesise speech as natural as a human speaking. However, the team at Google realised one shortcoming of the model — it had the probability of suffering from robustness. That is, the model blabbered, repeated and also skipped parts of the text arising from its inherent flexibility of the attention mechanism. 

This is where the improved version, NAT, comes in. 

Source: Google

NAT replaces the attention module in Tacotron 2 with a duration-based upsampler. The upsampler predicts the duration of each input phoneme and upsamples the encoded phoneme representation, ensuring that the output length is the same as the length of the predicted speech spectrogram, thus, removing the robustness issue and also improving the quality of the synthesised speech, making it more natural. People with ALS often have disfluent speech. To this, NAT enables accurate control of the speech duration to achieve fluency of the recreated voice. 


To improve the natural understanding of TTS input, the Google Research team has applied PnG BERT, which is pre-trained with self-supervision on both the phoneme representation and the grapheme representation of the same content from a large text corpus. It is then used as the encoder of the TSS model, resulting in improved pronunciation of the synthesised speech.

PnG NAT model integrates pre-trained PnG BERT as the encoder to the NAT model. NAT uses output from the encoder to predict the duration of each phoneme and then unsampled to match the length of the spectrogram. Finally, a non-attentive decoder converts the unsampled hidden representations into audio speech spectrograms to convert them into audio waveforms by a neural vocoder. 

 Source: Google

Source: Google

Summing up

In order to recreate Gleason’s voice, Google researchers first trained a PnG NAT model with recordings from 31 professional speakers. They then fine-tuned them with 30 minutes of Gleason’s recordings (which contain slurring since the recordings were done after Gleason was diagnosed with ALS, resulting in some disfluency).

However, to further naturalise it and improve the quality, the team leveraged the phoneme duration control of NAT and the model trained with professional speakers. Predicting the duration of each phoneme for a professional speaker and Gleason, the researchers used the geometric mean of the two values for each phoneme to guide the NAT output. The model was thus able to speak in Gleason’s voice and more fluently. 

Google, Google research, Project Euphonia, Natural voice recreation, text-to-speech synthesis, PnG BERT, Non-attentive tacotron, Tacotron 2

More Great AIM Stories

Debolina Biswas
After diving deep into the Indian startup ecosystem, Debolina is now a Technology Journalist. When not writing, she is found reading or playing with paint brushes and palette knives. She can be reached at

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM