The use of video call services have increased tremendously over the past month amid the COVID-19 pandemic. Services offered by companies like Google, Zoom and Slack have been the pick of many organizations. This newly increased demand also brings with it a slew of challenges.
There have been complaints of low-quality audio, video and even some issues with privacy, especially in the case of Zoom. Google has a handful of decent platforms like Hangouts and Duo, which are widely used. To offer seamless services to its Duo users, Google AI team has made adjustments to its architecture by including a variant of DeepMind’s WaveRNN.
Overview Of WaveRNN
via paper by DeepMind
In this work, the authors have described a list of techniques for reducing sampling time while maintaining high output quality. As illustrated above, WaveRNN is a single-layer recurrent neural network with a dual softmax layer. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4× faster than real-time on a GPU.
Even though the amount of computation and memory bandwidth are on a lighter side when it comes to mobile CPU compared to that of on a GPU, the researchers found that WaveRNN benchmarks on off-the-shelf mobile CPUs resources are sufficient for real-time on-device audio synthesis with a high-quality.
How Google Duo Leverages DeepMind’s Tech
The audio from any call via the internet is transmitted in the form of packets. These packets or small chunks of data are sent in continuous streams of audio and video. The low quality of audio or video on the receiving end mainly occurs when these chunks or packets don’t go in the order that they are sent. This is when we see the occasional glitch in our video calls. Real-time transfer of the packets is crucial for the video to be in sync with the audio.
To address these issues, WaveNetEQ, a generative model based on DeepMind’s WaveRNN technology, has been introduced, which is now deployed on Duo. In this model, the conditioning network directly receives this information as input in the form of non-text, such as intonation or pitch.
Whereas, the autoregressive network, as shown above, makes sure that the signal is continuous. This network also provides the short-term and mid-term structure for the speech, by having each generated sample depend on the network’s previous outputs.
In a way, the conditioning network manipulates the autoregressive network towards the right waveforms to match it, and produce audio that is consistent with the more slowly-moving input features.
This process, called teacher forcing, allows the model to grasp insights from the initial phase of training. This technique is a prime mover that kickstarts the training, gets output, and then its output is passed back as input for the next step.
This technique, when applied on any buffering on a Duo call in real-time, causes the synthetic and real audio streams to be merged whenever there is a packet loss event. And, for the merging to be seamless, the model generates slightly more output than is required, and then fades, ensuring a smooth transition.
WaveNetEQ’s another significant advantage is its ability to perform well on a mobile device.
Duo calls are end-to-end encrypted, and all the processing is done on the user’s device. Since the WaveNetEQ model is fast enough to run on a phone, while still providing state-of-the-art audio quality, it makes the right fit for Duo.
WaveNetEQ is already available in all Duo calls on Pixel 4 phones and is now being rolled out to additional models.