MITB Banner

Does Neural Network Compression Impact Transfer Learning

Share

The popularity of compression techniques grew with the increasing sizes of machine learning models, which grew in order to cater to the growing number of parameters (we are talking billions) among other factors. Compression comes in handy when a model has to be downloaded and run on a smartphone, which runs on low resources like memory.

Compression usually involves ditching the unnecessary.

Mainly compression techniques are of following types:

1. Parameter Pruning And Sharing

  • Reducing redundant parameters that are not sensitive to the performance
  • Robust to various settings
  • Redundancies in the model parameters are explored, and the uncritical yet redundant ones are removed.

2. Low-Rank Factorisation

  • Uses matrix decomposition to estimate the informative parameters of the deep convolutional neural networks

3. Transferred/Compact Convolutional Filters

  • Special structural convolutional filters are designed to reduce the parameter space and save storage/computation

4. Knowledge Distillation

  • A distilled model is used to train a more compact neural network to reproduce the output of a larger network

Pruning, however, is more popular when it comes to compression techniques because it deals with something as fundamental as removing unwanted weights among many other things.

On a similar note, a team from John Hopkins University, in their paper submitted at ICLR 2020, touch upon how compression, especially, wright pruning can impact the transfer learning. For their experiments, they picked BERT, a popular NLP model.

Based on their results, they conclude that:

  • Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all.
  • Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. 
  • High levels of pruning, additionally, prevents models from fitting downstream datasets, leading to further degradation.

How Pruning Impacts BERT

In this work, the authors tried to answer two questions:

  1. Does compressing BERT impede its ability to transfer to new tasks? 
  2. And does fine-tuning make BERT more or less compressible?

But why, BERT?

The rationale behind choosing BERT, wrote the authors, is largely due to its widespread usage and the need for compressing BERT in low-resource applications. 

When weight is close to zero, then its input is effectively ignored, which means the weight can be pruned,

Weight pruning is performed by:

  1. Picking a target percentage of weights to be pruned, say 50%. 
  2. And then by calculating a threshold such that 50% of weight magnitudes are under that threshold. 
  3. These weights are then removed.
  4. Training of the network is continued to recover any lost accuracy. 
  5. Step 1 is revisited to increase the percentage of weights pruned.

Pruning was done on a pre-trained BERT-Base model with sparsities from 0% to 90% and gradually pruned to sparsity over the first 10k steps of training.

This pre-training is continued on English Wikipedia to regain any lost accuracy.

The authors found that when pruning crosses 40%, the performance of BERT degrades. The reason, notes the authors, is because the pruning deletes pre-training information by setting weights to 0, which prevents any transfer of the useful inductive biases. 

Maintaining the inductive bias of the model learned during pre-training is the main challenge of compressing pre-trained models.

Pruning also regularises the model by keeping certain weights at zero, which might prevent fitting downstream datasets.

This work concluded that 30-40% of the weights could be discarded without affecting BERT’s universality as they do not encode any useful inductive bias.

The authors are hopeful that this work can be generalised to other language models like GPT-2, XLNet and others.

Key Takeaways

According to the author’s responses to open review, here a few highlights from the work:

  • The size of the pre-training dataset is the limiting factor in model compression, which should drive future work towards understanding the nature of that inductive bias.
  • Pruning does not seem practically useful as we cannot prune much (30-40%) without losing accuracy.
  • Ablating BERT’s inductive bias affects different tasks at different rates. This provides an additional lens into why language model pre-training helps other tasks, which is particularly interesting to the natural language processing community.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.