It is not hard to find an audio clip of Bill Gates. Being Microsoft’s founder and one of the richest men ever, he has always been present on the television. But if you listen to the voices below, you would be surprised that whatever the voice is saying, Gates have never actually said.
Audio1:
Audio2:
It might surprise you that the voices that you just heard are cloned voice of Gates, generated by an AI-powered Speech system which has been developed by Facebook researchers. And, we couldn’t agree more, that it sounds just like Gates.
What Did The Researchers Do?
Two engineers from Facebook, Sean Vasquez and Mike Lewis came up with a way to take text-to-speech systems to a whole new level and created a system using AI and machine learning called MelNet. It clones the voice almost exactly and has also generated convincing audio clips that match the voice of a handful of other famous personalities such as Jane Goodall and Stephen Hawking. The sample audios can be found here.
How Did The Engineers Do It?
MelNet is a generative model for audio in the frequency domain. And it is designed in such a way that it is capable of generating high-fidelity audio samples which capture structure at timescales. Simply put, it can even grasp the subtle consistencies contained in a speaker’s voice that are sometimes almost impossible for a human ear to hear.
According to the research paper by Vasquez and Lewis, data captured in a spectrogram is orders of magnitude more compact than that found in audio waveforms. Therefore, the algorithms take advantage of the density and produce more consistent voices. However, this system has some limitations too such as it cannot replicate a human that would change after a certain period of time; meaning, it can only mimic the voice it is fed during the training phase.
Apart from being a system that generates realistic voices of people, MelNet is also capable of generating music. The idea behind creating MEINet was to solve challenges such as reduction of information loss and producing high fidelity audio.
It is interesting to note that researchers trained MelNet using a number of data sets including voice recordings of thousands of TED Talks.
Deep Voices Of The Past
This latest innovation may sound really intriguing but this is not the first time AI and machine learning has been used to produce realistic voices.
The progress of these special use cases started with the unveiling of SampleRNN and WaveNet. And with time, machine learning entering the arena and the progress got better with DeepMind’s Google Assistant.
Chinese tech giant Baidu also has created software that can clone anyone’s voice. And, one of the major breakthroughs about this software is the fact that it just needs 3.7 seconds of audio to perform the task. That is not all, the software is created in such a way that it can even change a male voice to female and also change accents, style etc. keeping the voice same.
Wrapping Up
Even though iteration of MelNet marks yet another development in AI-generated voice, there are some aspects that it can be used as a weapon for cyber-attacks. Over the years, the world has witnessed some of the biggest breakthroughs and has also witnessed some of the worst downfalls — mostly in terms of cyber-attack. While these breakthroughs might seem really fancy and intriguing now, they pose significant unspoken challenges. Today, many devices come with security features related to voice and if these cloned voice technologies are utilized in an unethical way, it can cause some serious damage.