Now Reading
Complete Guide to DeLighT: Deep and Light-weight Transformer

Complete Guide to DeLighT: Deep and Light-weight Transformer

Rajkumar Lakshmanamoorthy

Transformer and its numerous variants achieve excellent performance today in various machine learning applications including sequence-to-sequence modeling, language modeling and computer vision tasks. The baseline transformer is still one of the most common choices for language modeling. Most transformer architectures comprise a basic transformer block in both the encoder and the decoder parts. A basic transformer block employs several layers of multi-head attention-based mechanisms to perform its task. One of the major differences between the transformer variants and the baseline transformer is the number of multi-head attention layers they incorporate. Models are scaled either wider or deeper by increasing the units in the hidden layers or stacking more transformer blocks respectively to improve performance. As the number of layers or units increases,  the number of parameters in the model increases. 

Large-scale transformer variants perform well in their tasks, but they are data-hungry and need careful regularization while training. Developers struggle to handle the issues caused by the very large number of parameters in their transformer-based models. For instance, Text-To-Text Transfer Transformer (T5) is a wide transformer variant with a dimension of 65,000. It has 11 billion parameters. Generative Pre-trained Transformers 3 (GPT-3) is a deep transformer variant with 96 transformer blocks. It has 175 billion parameters!

Here comes the need for a different architectural approach that retains the essence of transformer architecture but employs a relatively less number of parameters. It can save memory, save time and reduce training data requirements. Sachin Mehta and Luke Zettlemoyer of University of Washington, Marjan Ghazvininejad and Srinivasan Iyer of Facebook AI Research and Hannaneh Hajishirzi of Allen Institute of AI introduced a Deep and Light-weight Transformer named DeLighT that allocates parameters more efficiently among the transformer blocks or layers. This approach can be implemented in any transformer variant to make it parameter-efficient without decreasing the performance.

How does DeLighT work? 

The Deep and Light-weight Transformer architecture introduces the DeLighT transformation strategy based on the Group Linear Transformation (GLT) principle. It follows an expand-reduce principle to scale the transformer block by width or depth while efficiently distributing the parameters. However, GLT is local in nature that is not suitable for attention-based blocks that capture global context. Here, DeLighT uses feature shuffling similar to channel shuffling in convolution neural networks to capture global context capturing and share information among groups. 

DeLighT transformation strategy incorporates GLT principle, feature shuffling, and an input mixer connection efficiently to learn wider and deeper representations.

These wide and deep representations enable the DeLighT Transformer architecture to replace the multi-head attention layers with single head attention layers and replace feed-forward layers with light-weight feed-forward layers. The DeLighT Transformer blocks near input are narrow and shallow, whereas the blocks near output are wide and deep. This allows the architecture to distribute a minimal number of parameters very efficiently. 

Conventional Transformer Block
The parameter-efficient DeLighT block

Performance of DeLighT

The DeLighT was trained and evaluated for neural machine translation and language modelling tasks. A Few of WMT’14 datasets, WMT’16 datasets and WikiText-103 dataset are used. 

DeLighT outperforms baseline Transformer while reducing the number of parameters 2.8 times on the WMT’16 En-Ro machine translation task and 1.8 times on the WMT’14 En-Fr machine translation dataset with an increase in BLEU score of 0.4. 

DeLighT matches Transformer-XL’s performance in language modeling with 1.5 times fewer number of parameters.

Comparison of DeLighT with the baseline Vanilla Transformer in respect of the number of parameters and BLEU score
Comparison of DeLighT with present state-of-the-arts in machine translation

Python Implementation

DeLighT requires Pytorch 1.4.0+, Python 3.6+, NVIDIA GPU, NVIDIA NCCL and fairseq tool-kit. Following command downloads the source code from the official repository.

!git clone

Install dependencies with the following commands.

 cd delight
 pip install --editable ./ 

NVIDIA apex library helps faster training. The following commands download the necessary source codes and install them.

 git clone
 cd apex
 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
   --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
   --global-option="--fast_multihead_attn" ./ 

DeLighT in Neural Machine Translation

WMT’14 En-De Translation task

Download and preprocess the data using the following command.

!bash wmt14_en_de

Train the model using the following command. It should be noted that training may need at least 8 v100 GPUs each of memory 32GB. 

!python --d-m 128

The following code performs evaluation of model and comparison with standard BLEU score.

 # evaluate model with BLEU score
 python data-bin/wmt14_en_de/ --path <results_dir>/ --beam 5 --lenpen 0.4 --remove-bpe --batch-size 128 > GEN_RES_FILE
 bash scripts/ GEN_RES_FILE  

WMT’14 En-Fr Translation task

Similar to the English-German translation task, the following commands download and preprocess data, train the model, evaluate it with the WMT’14 English-French translation task.

See Also

 # download the data, preprocess it
 !bash wmt14_en_fr
 # train the model with a single node of 8 v100 GPUs each of memory 32GB
 !python --d-m 128
 # evaluate model and compare with gold standard BLEU score
 !python data-bin/wmt14_en_fr/ --path <results_dir>/ --beam 5 --lenpen 0.9 --remove-bpe --batch-size 128 --quiet 

DeLighT in Language Modelling

DeLighT was trained and evaluated on the famous WikiText-103 task. The following commands download the necessary dataset as a zipped file to the local machine or cloud environment.

 # download dataset
 cd delight/examples/language_model/
 cd ../.. 

The following commands extract the data and preprocess it using fareseq tool-kit.

 # preprocess with fareseq
 fairseq-preprocess \
     --only-source \
     --trainpref $TEXT/wiki.train.tokens \
     --validpref $TEXT/wiki.valid.tokens \
     --testpref $TEXT/wiki.test.tokens \
     --destdir data-bin/wikitext-103 \
     --workers 20 

Train the model on a single node of at least 8 v100 GPUs each of 32GB memory,

!python --d-m 128

Evaluate the model by generating English text and log the evaluation results using the following command.

!python data-bin/wikitext-103 --path <checkpoint_dir>/ --max-sentences 2 --tokens-per-sample 512 --context-window 400 --gen-subset test --res-file eval_logs.txt

Wrapping up

DeLighT Transformer outperforms present state-of-the-art models in neural machine translation and language modeling while employing very few parameters. Reduction in the number of parameters enables the models to train with less data, less memory and less time. The strategy of DeLighT transformation is attempted by its developers for machine translation and language modeling tasks alone. Future works may help develop computer vision models using DeLighT transformation strategy that are parameter-efficient.

Note: Illustrations are obtained from the original research paper.

Further reading

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top