Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT). Evolved Transformer outperforms Vanilla Transformer, especially on translation tasks with improved BLEU score, well-reduced model parameters and increased computation efficiency.
Recurrent neural networks showed good performance in sequence-to-sequence tasks over a long time. However, the numerous model parameters used in recurrent neural networks necessitated a computationally-effective alternative architecture. Convolutional neural networks, feed-forward networks outperformed traditional recurrent networks in specific tasks. But they could not achieve a generalized architecture to address any sequence problem.
Sign up for your weekly dose of what's up in emerging technology.
Transformers have emerged as a better alternative and a generalized approach incorporating self-attention mechanisms. Transformers become more powerful in solving problems from various platforms, including text, audio, image, and video. A lot of variants and extensions of the Vanilla Transformer are developed in a task-specific manner. Few remarkable examples include the Vision Transformer and the ResNet-backed Vision Transformer for computer vision tasks, and the TransUNet for Medical Image Segmentation tasks. Similarly, the Google Brain team researchers David R. So, Chen Liang, Quoc V. Le have developed Evolved Transformer through the neural architecture search (NAS) approach targeting Neural Machine Translation tasks and they have succeeded!
Models developed by neural architecture search have begun to outperform human-developed models in many applications. Neural architecture search is proven to be better than Reinforcement Learning especially when the training resources are limited. In this work, the famous tournament selection architecture is applied to do a model search. The vanilla Transformer is employed to warm start the search. The search has been performed directly on the WMT 2014 English-German translation task with the newly-developed Progressive Dynamic Hurdles (PDH) algorithm. As a result of the search, a new model has been evolved that outperformed the vanilla Transformer on four well-established language tasks:
- WMT 2014 English-German (En-De) translation task,
- WMT 2014 English-French (En-Fr) translation task,
- WMT 2014 English-Czech (En-Cs) translation task and
- the 1 Billion Word Language Model Benchmark (LM1B).
This model has been named Evolved Transformer, shortly known as the ET.
PyTorch Implementation of the Evolved Transformer
The pre-built, pre-trained architecture of Evolved Transformer runs best in the GPU or TPU devices. The model and the necessary files can be downloaded from the source repository using the following command.
!git clone https://github.com/Shikhar-S/EvolvedTransformer.git
Proper download of files can be ensured by running the following command.
The environment with dependencies can be created using the following commands.
%%bash cd EvolvedTransformer/ pip3 install -r requirements.txt # install spacy python3 -m spacy download en
Once the environment is created, the model can be retrained or evaluated with the in-built dataset or a custom dataset. The following codes run text classification on the in-built AG_NEWS dataset.
%%bash cd EvolvedTransformer/ python3 main.py # run on Evolved Transformer’s encoder python3 main.py --evolved true
A generalized Evolved Transformer block has been published as a Class in the PyTorch environment. Any custom architecture can incorporate this Class to build a new model on top of the Evolved Transformer. The following codes establish the
from models.embedder import Embedder, PositionalEncoder import math import torch import torch.nn as nn import torch.nn.functional as F from models.gated_linear_unit import GLU class EvolvedTransformerBlock(nn.Module): def __init__(self,d_model,num_heads=8,ff_hidden=4): super(EvolvedTransformerBlock,self).__init__() self.attention = nn.MultiheadAttention(d_model, num_heads) self.layer_norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(4)]) self.feed_forward = nn.Sequential( nn.Linear(d_model,ff_hidden*d_model), nn.ReLU(), nn.Linear(ff_hidden*d_model,d_model), ) self.glu = GLU(d_model,1) self.left_net = nn.Sequential( nn.Linear(d_model,ff_hidden*d_model), nn.ReLU() ) self.right_net = nn.Sequential( nn.Conv1d(in_channels=d_model,out_channels=d_model//2,kernel_size=3,padding=1), nn.ReLU() ) self.mid_layer_norm=nn.LayerNorm(d_model*ff_hidden) self.sep_conv=nn.Sequential( nn.Conv1d(in_channels=d_model*ff_hidden,out_channels=1,kernel_size=9,padding=4), nn.Conv1d(in_channels=1,out_channels=d_model,kernel_size=1) ) def forward(self,x): glued = self.glu(self.layer_norms(x))+x glu_normed = self.layer_norms(glued) left_branch = self.left_net(glu_normed) right_branch = self.right_net(glu_normed.transpose(1,2)).transpose(1,2) right_branch = F.pad(input=right_branch, pad=(0,left_branch.shape-right_branch.shape,0,0,0,0), mode='constant', value=0) mid_result = left_branch+right_branch mid_result = self.mid_layer_norm(mid_result) mid_result = self.sep_conv(mid_result.transpose(1,2)).transpose(1,2) mid_result = mid_result + glued normed = self.layer_norms(mid_result) normed=normed.transpose(0,1) attended = self.attention(normed,normed,normed,need_weights=False).transpose(0,1) + mid_result normed = self.layer_norms(attended) forwarded = self.feed_forward(normed)+attended return forwarded
Custom configurations to the pre-built model can be made with the following codes.
import argparse import logging import utils logger = utils.get_logger() def str2bool(v): return v.lower() in ('true') parser = argparse.ArgumentParser() parser.add_argument("--batch",type=int,default=16) parser.add_argument("--evolved",type=str2bool,default=False) parser.add_argument("--epochs",type=int,default=10) parser.add_argument("--model_dim",type=int,default=32) parser.add_argument("--max_seq_len",type=int,default=200) parser.add_argument("--backend",type=str,default='auto',choices=['cpu', 'gpu','auto']) parser.add_argument('--ngrams',type=int,default=2) parser.add_argument('--train_split',type=float,default=0.95) def get_args(): logger.info('Parsing arguments') args,unparsed = parser.parse_known_args() return args, unparsed
TensorFlow Implementation of the Evolved Transformer
TensorFlow implementation of the Evolved Transformer is performed through the Tensor2Tensor (T2T) framework. It yields a highly-efficient pre-trained model that can be implemented in minimal time even in a CPU device. The following codes install Tensor2Tensor and its dependencies in the local machine or cloud environment.
!pip install tensor2tensor
For the custom application of the Evolved Transformer or to build architecture on top of it, a module is developed in the
models Class of Tensor2Tensor framework that can be imported using the following command.
from tensor2tensor.models import evolved_transformer
Tensor2Tensor integrates a lot of famous models and datasets at one place. The following command gives the list of pre-trained models, datasets and suitable hyperparameters. Users can choose any model and problem of interest and run either in a terminal or as python code.
The Evolved Transformer can be invoked along with an example translation problem using the following commands. Here the base CPU version of the model is run for the sake of simplicity. The WMT 2014 English-to-German translation task is chosen as our problem. It should be noted that customized training may take hours based on the device configuration.
Setting the initial parameters, defining the problem and the model,
%%bash PROBLEM=translate_ende_wmt32k MODEL=evolved_transformer HPARAMS=evolved_transformer_base DATA_DIR=$HOME/t2t_data TMP_DIR=/tmp/t2t_datagen TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR # Generate data t2t-datagen \ --data_dir=$DATA_DIR \ --tmp_dir=$TMP_DIR \ --problem=$PROBLEM
Training the model based on our requirements,
%%bash # Train # * If you run out of memory, add --hparams='batch_size=1024'. t2t-trainer \ --data_dir=$DATA_DIR \ --problem=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --output_dir=$TRAIN_DIR
Running decoder with a sample input English-German text pair,
%%bash # Decode DECODE_FILE=$DATA_DIR/decode_this.txt echo "Hello world" >> $DECODE_FILE echo "Goodbye world" >> $DECODE_FILE echo -e 'Hallo Welt\nAuf Wiedersehen Welt' > ref-translation.de BEAM_SIZE=4 ALPHA=0.6 t2t-decoder \ --data_dir=$DATA_DIR \ --problem=$PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --output_dir=$TRAIN_DIR \ --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \ --decode_from_file=$DECODE_FILE \ --decode_to_file=translation.en
Visualizing the translation performance for the predefined test text,
%%bash # See the translations cat translation.en
Evaluating the BLEU score metric and comparing it with a reference value
%%bash # Evaluate the BLEU score t2t-bleu --translation=translation.en --reference=ref-translation.de
The Evolved Transformer outperforms the Vanilla Transformer with a BLEU score of 29.8 with the WMT 2014 English-German translation task and has 37.6% lesser parameters. Thus it utilises relatively lesser memory and demonstrates greater computational efficiency.