“Anytime you’re listening to high-quality audio, you’re likely using Dolby,” declared Vivek Kumar, who heads the AI team for Dolby Labs. Speaking at the PyTorch DevCon event late last year, Kumar briefly spoke about how PyTorch has become the go-to tool for deep learning-based audio research. According to Kumar, there are nearly 11 billion devices that use Dolby services. Let us take a look at how PyTorch became the pick of tools for such an ambitious, yet personal service like audio.
Why PyTorch Has An Edge
The main advantage that is often accredited to PyTorch is its flexibility. The users are allowed to play with much more native Python with lots of conditions, and manipulate training conditions during the actual run. This is unlike TensorFlow, where the user has to define the whole architecture upfront and then run the program. Along with flexibility, PyTorch also offers a better debugging service.
In the talk he gave at the PyTorch conference, Kumar shed some light on what makes PyTorch great. He listed Dynamic graphs as a key feature of PyTorch, apart from tremendous support from the community.
Especially for recurrent neural nets, which are widely used for audio synthesis, dynamic graphs play a crucial role in working on variable sequence lengths, which are tedious with static graphs.
If one has to write custom layers, with dynamic frameworks, one need not write the backward pass, thanks to PyTorch’s automatic differentiation engine, Autograd. This is a handy feature since writing the backward pass of networks such as the LSTMs is quite tricky, and one can easily run into errors.
SpeechBrain, the project that powers Dolby’s deep learning efforts, sits atop a PyTorch framework. SpeechBrain, launched late last year, aims at building a single flexible platform that incorporates and interfaces with all the popular frameworks that are used for audio synthesis, which include systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g., beamforming), self-supervised and unsupervised learning, speech contamination/augmentation, among many others.
The reason behind audio AI community’s reliability on PyTorch can be summarised as follows:
- Well-designed, flexible, popular, and well-documented toolkit with a very large community
- Natural implementation of many speech applications rely on deep learning and signal processing techniques
- End-to-end design of differentiable systems with great feasibility for tasks such as joint training, multi-task learning, and cooperative learning
The numerous availability of options for audio synthesis can further be verified by taking a look at the functions of a single package called ‘torchaudio’:
torchaudio leverages PyTorch’s GPU support and provides many tools to make data loading easy and more readable.
torchaudio supports a growing list of transformations.
- Resample for resampling waveform to a different sample rate
- Spectrogram can be called to create a spectrogram from a waveform
- ComplexNorm is used to compute the norm of a complex tensor
- AmplitudeToDB can be used to turn a spectrogram from the power/amplitude scale to the decibel scale
- MFCC allows one to create the Mel-frequency cepstrum coefficients from a waveform
- MelSpectrogram can be used to create MEL Spectrograms from a waveform using the STFT function in PyTorch
- TimeStretch for stretching a spectrogram in time without modifying pitch for a given rate
According to the PyTorch team, torchaudio aims to apply PyTorch to the audio domain. It provides strong GPU acceleration, having a focus on trainable features through the autograd system, and user-friendly tensor and dimension names. Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of Pytorch can be seen in torchaudio through having all the computations be through Pytorch operations, which makes it easy to use and feel like a natural extension.
Check more here.