Now Reading
Hands-on Guide to The Evolved Transformer on Neural Machine Translation

Hands-on Guide to The Evolved Transformer on Neural Machine Translation

Rajkumar Lakshmanamoorthy
evolved transformer

Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT). Evolved Transformer outperforms Vanilla Transformer, especially on translation tasks with improved BLEU score, well-reduced model parameters and increased computation efficiency. 

Recurrent neural networks showed good performance in sequence-to-sequence tasks over a long time. However, the numerous model parameters used in recurrent neural networks necessitated a computationally-effective alternative architecture. Convolutional neural networks, feed-forward networks outperformed traditional recurrent networks in specific tasks. But they could not achieve a generalized architecture to address any sequence problem. 

Transformers have emerged as a better alternative and a generalized approach incorporating self-attention mechanisms. Transformers become more powerful in solving problems from various platforms, including text, audio, image, and video. A lot of variants and extensions of the Vanilla Transformer are developed in a task-specific manner. Few remarkable examples include the Vision Transformer and the ResNet-backed Vision Transformer for computer vision tasks, and the TransUNet for Medical Image Segmentation tasks. Similarly, the Google Brain team researchers David R. So, Chen Liang, Quoc V. Le have developed Evolved Transformer through the neural architecture search (NAS) approach targeting Neural Machine Translation tasks and they have succeeded!

Models developed by neural architecture search have begun to outperform human-developed models in many applications. Neural architecture search is proven to be better than Reinforcement Learning especially when the training resources are limited. In this work, the famous tournament selection architecture is applied to do a model search. The vanilla Transformer is employed to warm start the search. The search has been performed directly on the WMT 2014 English-German translation task with the newly-developed Progressive Dynamic Hurdles (PDH) algorithm. As a result of the search, a new model has been evolved that outperformed the vanilla Transformer on four well-established language tasks:

  1. WMT 2014 English-German (En-De) translation task, 
  2. WMT 2014 English-French (En-Fr) translation task, 
  3. WMT 2014 English-Czech (En-Cs) translation task and 
  4. the 1 Billion Word Language Model Benchmark (LM1B).

This model has been named Evolved Transformer, shortly known as the ET.

Evolved Transformer
Comparison of the encoder blocks between the Vanilla Transformer and the Evolved Transformer (Source)
Evolved Transformer
Comparison of the decoder blocks between the Vanilla Transformer and the Evolved Transformer (Source)

PyTorch Implementation of the Evolved Transformer

The pre-built, pre-trained architecture of Evolved Transformer runs best in the GPU or TPU devices. The model and the necessary files can be downloaded from the source repository using the following command.

!git clone 

Proper download of files can be ensured by running the following command.

!ls EvolvedTransformer/


The environment with dependencies can be created using the following commands.

 cd EvolvedTransformer/
 pip3 install -r requirements.txt
 # install spacy
 python3 -m spacy download en 

Once the environment is created, the model can be retrained or evaluated with the in-built dataset or a custom dataset. The following codes run text classification on the in-built AG_NEWS dataset.

 cd EvolvedTransformer/
 # run on Evolved Transformer’s encoder
 python3 --evolved true  

A generalized Evolved Transformer block has been published as a Class in the PyTorch environment. Any custom architecture can incorporate this Class to build a new model on top of the Evolved Transformer. The following codes establish the EvolvedTransformerBlock Class.

 from models.embedder import Embedder, PositionalEncoder
 import math
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from models.gated_linear_unit import GLU
 class EvolvedTransformerBlock(nn.Module):
     def __init__(self,d_model,num_heads=8,ff_hidden=4):
         self.attention = nn.MultiheadAttention(d_model, num_heads) 
         self.layer_norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(4)])
         self.feed_forward = nn.Sequential(
         self.glu = GLU(d_model,1)
         self.left_net = nn.Sequential(
         self.right_net = nn.Sequential(
     def forward(self,x):
         glued = self.glu(self.layer_norms[0](x))+x
         glu_normed = self.layer_norms[1](glued)
         left_branch = self.left_net(glu_normed)
         right_branch = self.right_net(glu_normed.transpose(1,2)).transpose(1,2)
         right_branch = F.pad(input=right_branch, pad=(0,left_branch.shape[2]-right_branch.shape[2],0,0,0,0), mode='constant', value=0)
         mid_result = left_branch+right_branch
         mid_result = self.mid_layer_norm(mid_result)
         mid_result = self.sep_conv(mid_result.transpose(1,2)).transpose(1,2)
         mid_result = mid_result + glued
         normed = self.layer_norms[2](mid_result)
         attended = self.attention(normed,normed,normed,need_weights=False)[0].transpose(0,1) + mid_result
         normed = self.layer_norms[3](attended)
         forwarded = self.feed_forward(normed)+attended
         return forwarded 

Custom configurations to the pre-built model can be made with the following codes.

 import argparse
 import logging
 import utils
 logger = utils.get_logger()
 def str2bool(v):
     return v.lower() in ('true')
 parser = argparse.ArgumentParser()
 parser.add_argument("--backend",type=str,default='auto',choices=['cpu', 'gpu','auto'])
 def get_args():'Parsing arguments')
     args,unparsed = parser.parse_known_args()
     return args, unparsed 

TensorFlow Implementation of the Evolved Transformer

TensorFlow implementation of the Evolved Transformer is performed through the Tensor2Tensor (T2T) framework. It yields a highly-efficient pre-trained model that can be implemented in minimal time even in a CPU device. The following codes install Tensor2Tensor and its dependencies in the local machine or cloud environment.

!pip install tensor2tensor

For the custom application of the Evolved Transformer or to build architecture on top of it, a module is developed in the models Class of Tensor2Tensor framework that can be imported using the following command.

from tensor2tensor.models import evolved_transformer

See Also

Tensor2Tensor integrates a lot of famous models and datasets at one place. The following command gives the list of pre-trained models, datasets and suitable hyperparameters. Users can choose any model and problem of interest and run either in a terminal or as python code. 

!t2t-trainer --registry_help

The Evolved Transformer can be invoked along with an example translation problem using the following commands. Here the base CPU version of the model is run for the sake of simplicity. The WMT 2014 English-to-German translation task is chosen as our problem. It should be noted that customized training may take hours based on the device configuration.

Setting the initial parameters, defining the problem and the model,

 # Generate data
 t2t-datagen \
   --data_dir=$DATA_DIR \
   --tmp_dir=$TMP_DIR \

Training the model based on our requirements,

 # Train
 # *  If you run out of memory, add --hparams='batch_size=1024'.
 t2t-trainer \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \

Running decoder with a sample input English-German text pair,

 # Decode
 echo "Hello world" >> $DECODE_FILE
 echo "Goodbye world" >> $DECODE_FILE
 echo -e 'Hallo Welt\nAuf Wiedersehen Welt' >
 t2t-decoder \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR \
   --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
   --decode_from_file=$DECODE_FILE \

Visualizing the translation performance for the predefined test text,

 # See the translations
 cat translation.en 

Evaluating the BLEU score metric and comparing it with a reference value

 # Evaluate the BLEU score
 t2t-bleu --translation=translation.en 

Wrapping up

The Evolved Transformer outperforms the Vanilla Transformer with a BLEU score of 29.8 with the WMT 2014 English-German translation task and has 37.6% lesser parameters. Thus it utilises relatively lesser memory and demonstrates greater computational efficiency. 

evolved transformer
Comparison of performance between the Vanilla Transformer and the Evolved Transformer (Source)

Further reading:

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top