Transformer and its numerous variants achieve excellent performance today in various machine learning applications including sequence-to-sequence modeling, language modeling and computer vision tasks. The baseline transformer is still one of the most common choices for language modeling. Most transformer architectures comprise a basic transformer block in both the encoder and the decoder parts. A basic transformer block employs several layers of multi-head attention-based mechanisms to perform its task. One of the major differences between the transformer variants and the baseline transformer is the number of multi-head attention layers they incorporate. Models are scaled either wider or deeper by increasing the units in the hidden layers or stacking more transformer blocks respectively to improve performance. As the number of layers or units increases, the number of parameters in the model increases.
Large-scale transformer variants perform well in their tasks, but they are data-hungry and need careful regularization while training. Developers struggle to handle the issues caused by the very large number of parameters in their transformer-based models. For instance, Text-To-Text Transfer Transformer (T5) is a wide transformer variant with a dimension of 65,000. It has 11 billion parameters. Generative Pre-trained Transformers 3 (GPT-3) is a deep transformer variant with 96 transformer blocks. It has 175 billion parameters!
Sign up for your weekly dose of what's up in emerging technology.
Here comes the need for a different architectural approach that retains the essence of transformer architecture but employs a relatively less number of parameters. It can save memory, save time and reduce training data requirements. Sachin Mehta and Luke Zettlemoyer of University of Washington, Marjan Ghazvininejad and Srinivasan Iyer of Facebook AI Research and Hannaneh Hajishirzi of Allen Institute of AI introduced a Deep and Light-weight Transformer named DeLighT that allocates parameters more efficiently among the transformer blocks or layers. This approach can be implemented in any transformer variant to make it parameter-efficient without decreasing the performance.
How does DeLighT work?
The Deep and Light-weight Transformer architecture introduces the DeLighT transformation strategy based on the Group Linear Transformation (GLT) principle. It follows an expand-reduce principle to scale the transformer block by width or depth while efficiently distributing the parameters. However, GLT is local in nature that is not suitable for attention-based blocks that capture global context. Here, DeLighT uses feature shuffling similar to channel shuffling in convolution neural networks to capture global context capturing and share information among groups.
These wide and deep representations enable the DeLighT Transformer architecture to replace the multi-head attention layers with single head attention layers and replace feed-forward layers with light-weight feed-forward layers. The DeLighT Transformer blocks near input are narrow and shallow, whereas the blocks near output are wide and deep. This allows the architecture to distribute a minimal number of parameters very efficiently.
Performance of DeLighT
The DeLighT was trained and evaluated for neural machine translation and language modelling tasks. A Few of WMT’14 datasets, WMT’16 datasets and WikiText-103 dataset are used.
DeLighT outperforms baseline Transformer while reducing the number of parameters 2.8 times on the WMT’16 En-Ro machine translation task and 1.8 times on the WMT’14 En-Fr machine translation dataset with an increase in BLEU score of 0.4.
DeLighT matches Transformer-XL’s performance in language modeling with 1.5 times fewer number of parameters.
!git clone https://github.com/sacmehta/delight
Install dependencies with the following commands.
%%bash cd delight pip install --editable ./
NVIDIA apex library helps faster training. The following commands download the necessary source codes and install them.
%%bash git clone https://github.com/NVIDIA/apex cd apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \ --global-option="--deprecated_fused_adam" --global-option="--xentropy" \ --global-option="--fast_multihead_attn" ./
DeLighT in Neural Machine Translation
WMT’14 En-De Translation task
Download and preprocess the data using the following command.
!bash prepare_nmt_dataset.sh wmt14_en_de
Train the model using the following command. It should be noted that training may need at least 8 v100 GPUs each of memory 32GB.
!python nmt_wmt14_en2de.py --d-m 128
The following code performs evaluation of model and comparison with standard BLEU score.
# evaluate model with BLEU score %%bash GEN_RES_FILE=gen_out.out python generate.py data-bin/wmt14_en_de/ --path <results_dir>/checkpoint_best.pt --beam 5 --lenpen 0.4 --remove-bpe --batch-size 128 > GEN_RES_FILE bash scripts/compound_split_bleu.sh GEN_RES_FILE
WMT’14 En-Fr Translation task
Similar to the English-German translation task, the following commands download and preprocess data, train the model, evaluate it with the WMT’14 English-French translation task.
# download the data, preprocess it !bash prepare_nmt_dataset.sh wmt14_en_fr # train the model with a single node of 8 v100 GPUs each of memory 32GB !python nmt_wmt14_en2fr.py --d-m 128 # evaluate model and compare with gold standard BLEU score !python generate.py data-bin/wmt14_en_fr/ --path <results_dir>/checkpoint_best.pt --beam 5 --lenpen 0.9 --remove-bpe --batch-size 128 --quiet
DeLighT in Language Modelling
DeLighT was trained and evaluated on the famous WikiText-103 task. The following commands download the necessary dataset as a zipped file to the local machine or cloud environment.
%%bash # download dataset cd delight/examples/language_model/ bash prepare-wikitext-103.sh cd ../..
The following commands extract the data and preprocess it using fareseq tool-kit.
%%bash TEXT=examples/language_model/wikitext-103 # preprocess with fareseq fairseq-preprocess \ --only-source \ --trainpref $TEXT/wiki.train.tokens \ --validpref $TEXT/wiki.valid.tokens \ --testpref $TEXT/wiki.test.tokens \ --destdir data-bin/wikitext-103 \ --workers 20
Train the model on a single node of at least 8 v100 GPUs each of 32GB memory,
!python lm_wikitext_103.py --d-m 128
Evaluate the model by generating English text and log the evaluation results using the following command.
!python eval_lm.py data-bin/wikitext-103 --path <checkpoint_dir>/checkpoint_best.pt --max-sentences 2 --tokens-per-sample 512 --context-window 400 --gen-subset test --res-file eval_logs.txt
DeLighT Transformer outperforms present state-of-the-art models in neural machine translation and language modeling while employing very few parameters. Reduction in the number of parameters enables the models to train with less data, less memory and less time. The strategy of DeLighT transformation is attempted by its developers for machine translation and language modeling tasks alone. Future works may help develop computer vision models using DeLighT transformation strategy that are parameter-efficient.
Note: Illustrations are obtained from the original research paper.