Speech is one of the most important and almost always, the prime way of communication for humans. This mode of communication occupies a majority of services. From call centres to Amazon’s Alexa, industries and products are driven by speech. Many of these processes are automated — a voice is recorded which is then played when a service is invoked. There has been a growing need of having this service available in a manner that is relatable and relevant — more human-like.
The advent of machine learning witnessed a remarkable rise in the number of speech synthesis projects. Google’s Wavenet paper is one such example that catalysed the whole domain.
Here are the top advancements in speech synthesis that have been boosted with the introduction of deep learning:
Real-Time Voice Cloning
This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis.
This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content.
Users input a short voice sample and the model — trained only during playback time — can immediately deliver text-to-speech utterances in the style of the sampled voice.
Listen To Audiobooks In Your Own Voice
Bengaluru’s Deepsync offers an Augmented Intelligence that learns the way you speak. That is correct — it creates a digital model of the user’s voice and learns hundreds of features including the accent to the way you subtly express oneself. It does this by using advanced forms of deep learning.
Once the voice is synced with Deepsync, the user can record content for the 80–90% of their entire work.
Guessing Face From Speech
Researchers at MIT developed an algorithm Speech2Face that can listen to a voice and guess the face of the speaker with decent accuracy.
During training, the model learns voice to face correlations that allow it to produce images that capture the age, gender and ethnicity of the speakers. The face is generated in a self-supervised manner, by utilising the natural co-occurrence of faces and speech in internet videos, without the need to model attributes explicitly.
The applications of which can be of a wide range- from identifying the speaker in remote locations to giving a voice to those with speech impediments by reverse engineering their facial features.
Facebook’s MelNet Clones Bill Gates’ Voice
“A cramp is no small danger on a swim,” Bill Gates cautions randomly. “Write a fond note to the friend you cherish,” he advises in a few audio clips released by Facebook AI. However, each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.
MelNet, the model developed by the AI researchers at Facebook, combined a highly expressive autoregressive model with a multiscale modelling scheme to generate high-resolution spectrograms that has a realistic structure on both local and global scales.
The applications of MelNet cover a diverse set of tasks, including unconditional speech generation, music generation, and text-to-speech synthesis.
Human-Like Speech With Amazon Polly
Amazon’s Text-to-Speech (TTS) service, Polly, uses advanced deep learning technologies to synthesise speech that sounds like a human voice.
Amazon Polly offers Neural Text-to-Speech (NTTS) voices, where one can select the ideal voice and build speech-enabled applications that suit for different regions.
“Amazon Polly voices are not just high in quality, but are as good as natural human speech for teaching a language,” said Severin Hacker of Duolingo, world’s most popular language-learning platform.
Text2Speech With GANs
GAN-TTS is a Generative Adversarial Network that has been used to generate speech from text. The results have shown high-fidelity in speech synthesis. The model’s feed-forward generator is a convolutional neural network that is coupled with an ensemble of multiple discriminators which evaluate the generated (and real) audio based on multi-frequency random windows.
GANs, with their parallelisable traits, make for a much better option for generating audio from the text than WaveNet because it largely depends on the sequential generation of one audio sample at a time, which is undesirable for present-day applications.
Many speech syntheses deep learning techniques use variants of other fundamental models like RNNs or CNNs. Currently, even GANs are being used to generate audio. These techniques have the potential to revolutionise products ranging from helping the visually impaired to automated music generation, from media editing to customer service.