The world has undoubtedly become a small place, thanks to the advancements in communication. But our communication systems tend to fall short at times. Poor network, low bandwidth or high traffic are the major causes of such disruptions. To remedy this, Google has introduced Lyra — a high-quality, very low-bitrate speech codec for making voice communication available even on the slowest of networks.
What Is Lyra?
Popular real-time communication frameworks such as WebRTC use a compression technique called codec to encode and decode signals for transmission or storage. Codecs can efficiently transfer heavy data, both audio and video. The amount of data encoded in a unit time is measured in bitrate.
However, codec struggles with supporting high-quality, low latency communication using less data in real-time. Though it might seem counterintuitive, high-quality speech codecs require a higher bitrate than most modern video codecs. A lower bitrate for audio codec results in a less intelligible and robotic voice texture.
Lyra is a novel method for compressing and transmitting voice signals. For this, the researchers applied traditional codec techniques and the latest machine learning methods on models trained on vast amounts of data.
Credit: Google AI Blog
Lyra extracts features or distinctive speech attributes (list of numbers representing the speech energy in different frequency bands, called log mel spectrograms) from the input every 40ms and compresses before transmitting. At the receiving end, a generative model converts the features to a speech signal.
Lyra’s new and improved ‘natural-sounding’ generative models maintain a low bitrate of codecs to achieve high-quality codecs, generally on par with state-of-art waveform codecs used in streaming platforms.
However, one drawback of these generative models is computational complexity. To overcome this, Lyra uses a cheaper variation of WaveRNN, a recurrent generative model. Though it works at a lower rate, it generates multiple parallel signals in different frequencies. These signals are then combined to output a signal at the desired sample rate. Hence, Lyra works on cloud servers and mid-range phones with a processing latency of 90ms. As per Google’s blog, this generative model is trained on thousands of hours of speech data and optimised to output the audio accurately.
In its current form, Lyra can operate at 3kbps and can be used in situations where the bandwidth conditions are insufficient for higher-bitrates. As compared to other codecs in the same category, Lyra offers more than 60 percent reduction in bandwidth — a performance comparable to Opus, a popular codec that operates at 8kbps.
Google has trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verified the audio quality with expert and crowdsourced listeners. Lyra is conceived to ensure universally accessible, high-quality audio experiences, said the Google spokesperson.
Wrapping Up
As per Google, with Lyra, users in the emerging markets will now have access to an efficient low-bitrate codec. Lyra can also be used in cloud environments, meaning users with different networks and device capabilities can chat seamlessly. When used with video compression techniques such as AV1, Lyra can also enable video calls over the internet.
Google is looking at acceleration via GPUs and TPUs to optimise Lyra’s performance and quality. “We are also beginning to research how these technologies can lead to a low-bitrate general-purpose audio codec (i.e., music and other non-speech use cases),” Google said in the blog post.