When OpenAI released its GPT model, it had 1.5 billion parameters and made it the biggest model back then. It was soon eclipsed by NVIDIA’s Megatron, which had 8 billion parameters. Last month Microsoft released the world’s largest language model Turing NLG that has 17 billion parameters. In terms of hardware, any model with more than 1.3 billion parameters cannot fit into a single GPU (even one with 32GB of memory), so the model itself must be parallelised, or broken into pieces, across multiple GPUs.
As models improved, they became larger and took more storage space. If you are using a neural network for some computer vision on your smartphone, then it is important to keep its footprint low. To achieve this, compression techniques such as pruning and quantisation were introduced.
However, compressing models also took a hit on their learning. Large models learn well but occupy more space; there is a trade-off.
To cut all the corners and implement an efficient training paradigm, the researchers at Berkeley AI research explored the consequences of heavy compression on small and large models.
How Much Does The Size Matter
For most training budgets, very large models appear impractical. Instead, the go-to strategy for maximising training efficiency is to use models with small hidden sizes or few layers because these models run faster and use less memory. As illustrated above, common practice would be to train small models until they converge and then run a compression technique lightly.
In an optimal approach, a large model is compressed heavily at the end. But how does this impact the performance of the model?
A team from John Hopkins University, in their paper submitted at ICLR 2020, touched upon how compression, especially, pruning can impact BERT. Their results showed that high levels of pruning or heavy compression lead to degradation of the model.
The work by BAIR researchers not only investigates the relation between size, compression and performance but also presents a right approach for dealing with large models.
As shown above, when the BLEU score (higher is better in the plot) or the metric that is used to measure the performance of translation models, larger models (blue) have a better score. Large is indeed, better. Training large models yield better results, but inference time becomes larger. So, as a routine, the researchers decided to compress the model, and they did it in a heavy manner. For compression, they chose pruning and quantisation, and the results could be seen in the plot below.
When the ROBERTa model underwent pruning and quantisation, the larger models performed well. As can be seen above on the left side of the image, the larger (orange) 24 layered model when pruned, that is going from right to left (decreasing parameters), its validation accuracy outperformed the smaller model (6 layered pink one) by a huge margin.
The most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations
Even though large models appear less efficient during inference, the authors observe that these models are more robust to compression. Therefore, they concluded that the best strategy for resource-constrained training is to train large models and then heavily compress them.
The authors, in their work, recommend the following:
- Train large and compress heavily.
- For machine translation, wider models outperform deeper models. So, increase width before going deeper.
- Increase model size, not batch size