With the increasing size of deep learning models, the memory and compute demands too have increased. Techniques have been developed to train deep neural networks faster. One approach is to use half-precision floating-point numbers; FP16 instead of FP32. Recently, researchers have found that using them together is a smarter choice.
Mixed precision is one such technique, which can be used to train with half-precision while maintaining the network accuracy achieved with single precision. Since this technique uses both single- and half-precision representations, it is referred to as mixed precision technique.
Nvidia has been developing mixed precision techniques to make the most of its tensor cores. Both TensorFlow and PyTorch enable mixed precision training. Now, PyTorch introduced native automatic mixed precision training.
Overview Of Mixed Precision
Most deep learning frameworks, including PyTorch, train using 32-bit floating-point(FP32). However, FP32 is not always essential to get results. A 16-bit floating-point for few operations can be great where FP32 takes up more time and space. So, NVIDIA researchers developed a methodology where mixed precision training can be executed for few operations in FP32, while the majority of the network is executed using 16-bit floating-point (FP16) arithmetic.
With FP16, a reduction in memory bandwidth and storage requirements up to two times can be achieved.
Using mixed precision training requires three steps:
- Convert the model to use the float16 data type.
- Accumulate float32 master weights.
- Preserve small gradient value using loss scaling.
Now, these techniques can be called with one line of code on PyTorch:
#Initialising mixed precision in PyTorch using one line of code:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
#Here, O1 indicates mixed precision.
The job of ‘amp’ is to check if a PyTorch function is whitelist/blacklist/neither. If whitelist, then all arguments are cast to FP16; if blacklist then FP32; and if neither, all arguments are taken the same type
- Whitelist: matrix multiply and convolution functions.
- Blacklist. neural net loss functions like softmax with cross-entropy.
Here’s a snippet of code on how to use automatic mixed precision on PyTorch with autocast():
# Create model
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Backward ops run in the same precision that autocast used for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer’s assigned params.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Instances of torch.cuda.amp.autocast enable autocasting for chosen regions. Autocasting automatically chooses the precision for GPU operations to improve performance while maintaining accuracy.
Instances of autocast serve as context managers or decorators that allow regions of your script to run in mixed precision.
In these regions, CUDA ops run in an op-specific dtype chosen by autocast to improve performance while maintaining accuracy. See the Autocast Op Reference for details.
torch.cuda.amp provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.
Ordinarily, “automatic mixed-precision training” uses torch.cuda.amp.autocast and torch.cuda.amp.GradScaler together. However, autocast and GradScaler are modular and may be used separately, if desired.
Key Takeaways
Why train deep neural networks in multiple precisions:
- Make precision decisions per layer or operation
- Full precision (FP32) were needed to maintain task-specific accuracy
- Reduced precision (FP16) everywhere else for speed and scale
By using multiple precisions, we can have the best of both worlds: speed and accuracy.
Try AMP on PyTorch here.