Are Larger Models Better For Compression

When OpenAI released its GPT model, it had 1.5 billion parameters and made it the biggest model back then. It was soon eclipsed by NVIDIA’s Megatron, which had 8 billion parameters. Last month Microsoft released the world’s largest language model Turing NLG that has 17 billion parameters. In terms of hardware, any model with more than 1.3 billion parameters cannot fit into a single GPU (even one with 32GB of memory), so the model itself must be parallelised, or broken into pieces, across multiple GPUs.


Sign up for your weekly dose of what's up in emerging technology.

As models improved, they became larger and took more storage space. If you are using a neural network for some computer vision on your smartphone, then it is important to keep its footprint low. To achieve this, compression techniques such as pruning and quantisation were introduced.

However, compressing models also took a hit on their learning. Large models learn well but occupy more space; there is a trade-off. 

To cut all the corners and implement an efficient training paradigm, the researchers at Berkeley AI research explored the consequences of heavy compression on small and large models.

How Much Does The Size Matter

For most training budgets, very large models appear impractical. Instead, the go-to strategy for maximising training efficiency is to use models with small hidden sizes or few layers because these models run faster and use less memory. As illustrated above, common practice would be to train small models until they converge and then run a compression technique lightly.

In an optimal approach, a large model is compressed heavily at the end. But how does this impact the performance of the model?

A team from John Hopkins University, in their paper submitted at ICLR 2020, touched upon how compression, especially, pruning can impact BERT. Their results showed that high levels of pruning or heavy compression lead to degradation of the model.

The work by BAIR researchers not only investigates the relation between size, compression and performance but also presents a right approach for dealing with large models.

via BAIR

As shown above, when the BLEU score (higher is better in the plot) or the metric that is used to measure the performance of translation models, larger models (blue) have a better score. Large is indeed, better. Training large models yield better results, but inference time becomes larger. So, as a routine, the researchers decided to compress the model, and they did it in a heavy manner. For compression, they chose pruning and quantisation, and the results could be seen in the plot below.

When the ROBERTa model underwent pruning and quantisation, the larger models performed well. As can be seen above on the left side of the image, the larger (orange) 24 layered model when pruned, that is going from right to left (decreasing parameters), its validation accuracy outperformed the smaller model (6 layered pink one) by a huge margin.

The most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations

Even though large models appear less efficient during inference, the authors observe that these models are more robust to compression. Therefore, they concluded that the best strategy for resource-constrained training is to train large models and then heavily compress them. 

Key Takeaways

The authors, in their work, recommend the following:

  • Train large and compress heavily.
  • For machine translation, wider models outperform deeper models. So, increase width before going deeper.
  • Increase model size, not batch size

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.