June 02 was a rather eventful day for the US’ Major League Baseball. It celebrated Lou Gehrig Day, remembering the Iron Horse of Yankees on the day he started the first baseman and the day he passed away from amyotrophic lateral sclerosis, or ALS.
Also known as Lou Gehrig’s disease or motor neurone disease, ALS is a progressive nervous disease affecting the brain and spinal cord nerve cells resulting in the loss of muscle control, and ultimately, the patient’s ability to speak, eat, move, and even breathe. Honouring the National Football League player, former American football player and advocate of ALS, Steve Gleason, recited Gehrig’s famous Luckiest Man speech. What was significant about the recitation was that Gleason, also battling ALS, used a machine learning (ML) model to recreate his natural voice.
ML Model for Natural Voice Recreation
Developed in collaboration with Google’s Project Euphonia (a Google Research initiative to help people with atypical speech), this ML model aims to empower people with impaired speaking ability resulting from ALS to be better understood.
The ML model, called the PnG NAT model, is a text-to-speech synthesis (TTS) model that merges two technologies — PnG BERT and Non-Attentive Tacotron (NAT) into one single model. Google suggests that the cumulation of these two technologies into one demonstrated better quality and fluency than the previous technologies, exhibiting greater possibilities.
In its recently published blog post, Google explains the model.
NAT is a sequence-to-sequence neural TTS model and a successor to Google’s Tacotron 2.
Tacotron 2 helps generate human-like speech from text using neural networks that are trained using only speech examples and corresponding to text transcripts. Tacotron 2 used a particular attention module to connect the input text sequence and output speech spectrogram frame sequence. Thus, enabling the model to know the important portions of the text and pay attention to those parts when generating each time step of the synthesised speech spectrogram. It was the first TTS model to successfully synthesise speech as natural as a human speaking. However, the team at Google realised one shortcoming of the model — it had the probability of suffering from robustness. That is, the model blabbered, repeated and also skipped parts of the text arising from its inherent flexibility of the attention mechanism.
This is where the improved version, NAT, comes in.
NAT replaces the attention module in Tacotron 2 with a duration-based upsampler. The upsampler predicts the duration of each input phoneme and upsamples the encoded phoneme representation, ensuring that the output length is the same as the length of the predicted speech spectrogram, thus, removing the robustness issue and also improving the quality of the synthesised speech, making it more natural. People with ALS often have disfluent speech. To this, NAT enables accurate control of the speech duration to achieve fluency of the recreated voice.
To improve the natural understanding of TTS input, the Google Research team has applied PnG BERT, which is pre-trained with self-supervision on both the phoneme representation and the grapheme representation of the same content from a large text corpus. It is then used as the encoder of the TSS model, resulting in improved pronunciation of the synthesised speech.
PnG NAT model integrates pre-trained PnG BERT as the encoder to the NAT model. NAT uses output from the encoder to predict the duration of each phoneme and then unsampled to match the length of the spectrogram. Finally, a non-attentive decoder converts the unsampled hidden representations into audio speech spectrograms to convert them into audio waveforms by a neural vocoder.
In order to recreate Gleason’s voice, Google researchers first trained a PnG NAT model with recordings from 31 professional speakers. They then fine-tuned them with 30 minutes of Gleason’s recordings (which contain slurring since the recordings were done after Gleason was diagnosed with ALS, resulting in some disfluency).
However, to further naturalise it and improve the quality, the team leveraged the phoneme duration control of NAT and the model trained with professional speakers. Predicting the duration of each phoneme for a professional speaker and Gleason, the researchers used the geometric mean of the two values for each phoneme to guide the NAT output. The model was thus able to speak in Gleason’s voice and more fluently.
Google, Google research, Project Euphonia, Natural voice recreation, text-to-speech synthesis, PnG BERT, Non-attentive tacotron, Tacotron 2