The popularity of compression techniques grew with the increasing sizes of machine learning models, which grew in order to cater to the growing number of parameters (we are talking billions) among other factors. Compression comes in handy when a model has to be downloaded and run on a smartphone, which runs on low resources like memory.
Compression usually involves ditching the unnecessary.
Mainly compression techniques are of following types:
1. Parameter Pruning And Sharing
- Reducing redundant parameters that are not sensitive to the performance
- Robust to various settings
- Redundancies in the model parameters are explored, and the uncritical yet redundant ones are removed.
2. Low-Rank Factorisation
- Uses matrix decomposition to estimate the informative parameters of the deep convolutional neural networks
3. Transferred/Compact Convolutional Filters
- Special structural convolutional filters are designed to reduce the parameter space and save storage/computation
4. Knowledge Distillation
- A distilled model is used to train a more compact neural network to reproduce the output of a larger network
Pruning, however, is more popular when it comes to compression techniques because it deals with something as fundamental as removing unwanted weights among many other things.
On a similar note, a team from John Hopkins University, in their paper submitted at ICLR 2020, touch upon how compression, especially, wright pruning can impact the transfer learning. For their experiments, they picked BERT, a popular NLP model.
Based on their results, they conclude that:
- Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all.
- Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks.
- High levels of pruning, additionally, prevents models from fitting downstream datasets, leading to further degradation.
How Pruning Impacts BERT
In this work, the authors tried to answer two questions:
- Does compressing BERT impede its ability to transfer to new tasks?
- And does fine-tuning make BERT more or less compressible?
But why, BERT?
The rationale behind choosing BERT, wrote the authors, is largely due to its widespread usage and the need for compressing BERT in low-resource applications.
When weight is close to zero, then its input is effectively ignored, which means the weight can be pruned,
Weight pruning is performed by:
- Picking a target percentage of weights to be pruned, say 50%.
- And then by calculating a threshold such that 50% of weight magnitudes are under that threshold.
- These weights are then removed.
- Training of the network is continued to recover any lost accuracy.
- Step 1 is revisited to increase the percentage of weights pruned.
Pruning was done on a pre-trained BERT-Base model with sparsities from 0% to 90% and gradually pruned to sparsity over the first 10k steps of training.
This pre-training is continued on English Wikipedia to regain any lost accuracy.
The authors found that when pruning crosses 40%, the performance of BERT degrades. The reason, notes the authors, is because the pruning deletes pre-training information by setting weights to 0, which prevents any transfer of the useful inductive biases.
Maintaining the inductive bias of the model learned during pre-training is the main challenge of compressing pre-trained models.
Pruning also regularises the model by keeping certain weights at zero, which might prevent fitting downstream datasets.
This work concluded that 30-40% of the weights could be discarded without affecting BERT’s universality as they do not encode any useful inductive bias.
The authors are hopeful that this work can be generalised to other language models like GPT-2, XLNet and others.
According to the author’s responses to open review, here a few highlights from the work:
- The size of the pre-training dataset is the limiting factor in model compression, which should drive future work towards understanding the nature of that inductive bias.
- Pruning does not seem practically useful as we cannot prune much (30-40%) without losing accuracy.
- Ablating BERT’s inductive bias affects different tasks at different rates. This provides an additional lens into why language model pre-training helps other tasks, which is particularly interesting to the natural language processing community.