Facebook’s open-source machine learning framework PyTorch recently announced the launch of v1.10. The new version of the framework is composed of over 3,400 commits since 1.9, made by 426 contributors. The new update is focused on improving the training and performance, alongside developer usability.
In June 2021, PyTorch had released v1.09, with improvements in torch.linalg, torch.special, and Complex Autograd, along with Mobile Interpreter, TorchElastic, the PyTorch RPC framework, APIs for model inference deployment, and PyTorch Profiler.
Here are key highlights of v1.10:
- CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads
- Several frontend APIs such as FX, torch.special, and nn.Module Parametrisation have been moved from beta to stable
- Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs
- APIs Android NNAPI support is now available in beta
CUDA Graphs APIs Integration (Beta)
In a bid to reduce CPU overheads for CUDA workloads, PyTorch now integrates CUDA Graphs APIs. As a result, it greatly reduces the CPU overhead for CPU-bound CUDA workloads, improving performance by increasing GPU utilisation. Plus, it reduces jitters for distributed workloads, and since parallel workloads have to wait for the slowest worker – reducing jitter improves overall parallel efficiency.
API Integration allows easy interop between the network parts captured by CUDA graphs and parts of the network that cannot be captured due to graph limitations.
Conjugate View (Beta)
For complex tensors (torch.conj()), PyTorch’s conjugation is now a constant time operation and returns a view of the input tensor with a conjugate bit set as can be seen by calling torch.is_conj(). For example, this has been conjugated in various PyTorch operations like matrix multiplication, dot production, etc., to fuse conjugation with the operation leading to significant performance gain and memory savings on both CUDA and CPU.
Python Code Transformation with FX
FX offers a Pythonic platform for transforming and lowering PyTorch programmes. For pass writers, this toolkit facilitates Python-to-Python transformation of functions and nn.Module instances. It aims to support a subset of Python language semantics to facilitate ease of implementation of transforms. With the latest update, FX is moving to stable.
Check out FX examples on GitHub.
This feature allows users to parametrise any parameter or buffer of an nn.Module without modifying the nn.Module itself is available in stable. This release adds weight normalisation (weight_norm), orthogonal parameterisation (matrix constraints and part of pruning) and more flexibility when creating your own parameterisation. See tutorials for more details.
In the latest PyTorch v1.10, several features are moving from beta to stable in the distributed package. Here are some of the features that are now stable:
- Remote module allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user.
- DDP Communication Hook: It allows users to override how DDP synchronises gradients across processes.
- ZeroRedundancyOptimiser: It can be used in conjunction with DistributedDataParallel to minimise the size of per-process optimiser states. With this new release, it now can handle uneven inputs to different data-parallel workers.
Performance Optimisation and Tooling
Profile-directed typing in TorchScript (Beta)
For compilation to be successful, TorchScript has a hard requirement for source code to have type annotations. Trial & error was the only way to add missing or incorrect type annotations in the past. This was inefficient and time-consuming. With the latest update, PyTorch has enabled profile directed typing for torch.jit.script by using existing tools like MonkeyType, making the process much easier, faster, and more efficient.
CPU Fusion (Beta)
In the latest PyTorch 1.10, the team has added an LLVM-based JIT compiler for CPUs that can fuse a sequence of torch library calls to improve performance. This is the ‘first time’ they have brought compilation to the CPUs, while they have had this capability for some time on GPUs. Check out the performance results here (Colab notebook).
PyTorch Profiler (Beta)
The main objective of PyTorch Profiler is to target the execution steps that are the most costly in time and memory and visualise the workload distribution between CPUs and GPUs. Here are some of the key features of PyTorch 1.10:
- Enhanced memory view
- Enhanced automated recommendations
- Enhanced kernel view
- Distributed training
- Correlate operators in the forward and backward pass
- Support for profiling on mobile devices
To get started with new features, check out the tutorials here.
PyTorch Mobile: Android NNAP Support (Beta)
Last year, PyTorch had released prototype support for Android’s neural networks API (NNAPI). It allows Android apps to run computationally intensive neural networks on chips that power mobile phones. This includes GPUs and NPUs (specialised neural processing units).
Since then, the team has added more op coverage, support for flexible load-time shapes, and the ability to run the model on the host for testing. Check out the tutorial for using this feature.
In addition to this, transfer learning steps have been added to object detection examples.
- TorchX: A new SDK for quickly building and deploying ML applications from research and development to production.
- TorchAudio: Here, the team has added text-to-speech pipeline, self-supervised model support, multi-channel support and MVDR beamforming module, RNN transducer (RNNT) loss function, and batch and filterbank support to filter function.
- TorchVision: Added new RegNet and EfficientNet models, FX-based feature extraction added to utilities, two new Automatic Augmentation techniques: Rand Augment and Trivial Augment, and updated training recipes.