MITB Banner

Hands-on Guide to The Evolved Transformer on Neural Machine Translation

Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT)
Share
evolved transformer

Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT). Evolved Transformer outperforms Vanilla Transformer, especially on translation tasks with improved BLEU score, well-reduced model parameters and increased computation efficiency. 

Recurrent neural networks showed good performance in sequence-to-sequence tasks over a long time. However, the numerous model parameters used in recurrent neural networks necessitated a computationally-effective alternative architecture. Convolutional neural networks, feed-forward networks outperformed traditional recurrent networks in specific tasks. But they could not achieve a generalized architecture to address any sequence problem. 

Transformers have emerged as a better alternative and a generalized approach incorporating self-attention mechanisms. Transformers become more powerful in solving problems from various platforms, including text, audio, image, and video. A lot of variants and extensions of the Vanilla Transformer are developed in a task-specific manner. Few remarkable examples include the Vision Transformer and the ResNet-backed Vision Transformer for computer vision tasks, and the TransUNet for Medical Image Segmentation tasks. Similarly, the Google Brain team researchers David R. So, Chen Liang, Quoc V. Le have developed Evolved Transformer through the neural architecture search (NAS) approach targeting Neural Machine Translation tasks and they have succeeded!

Models developed by neural architecture search have begun to outperform human-developed models in many applications. Neural architecture search is proven to be better than Reinforcement Learning especially when the training resources are limited. In this work, the famous tournament selection architecture is applied to do a model search. The vanilla Transformer is employed to warm start the search. The search has been performed directly on the WMT 2014 English-German translation task with the newly-developed Progressive Dynamic Hurdles (PDH) algorithm. As a result of the search, a new model has been evolved that outperformed the vanilla Transformer on four well-established language tasks:

  1. WMT 2014 English-German (En-De) translation task, 
  2. WMT 2014 English-French (En-Fr) translation task, 
  3. WMT 2014 English-Czech (En-Cs) translation task and 
  4. the 1 Billion Word Language Model Benchmark (LM1B).

This model has been named Evolved Transformer, shortly known as the ET.

Evolved Transformer
Comparison of the encoder blocks between the Vanilla Transformer and the Evolved Transformer (Source)
Evolved Transformer
Comparison of the decoder blocks between the Vanilla Transformer and the Evolved Transformer (Source)

PyTorch Implementation of the Evolved Transformer

The pre-built, pre-trained architecture of Evolved Transformer runs best in the GPU or TPU devices. The model and the necessary files can be downloaded from the source repository using the following command.

!git clone https://github.com/Shikhar-S/EvolvedTransformer.git 

Proper download of files can be ensured by running the following command.

!ls EvolvedTransformer/

Output:

The environment with dependencies can be created using the following commands.

 %%bash
 cd EvolvedTransformer/
 pip3 install -r requirements.txt
 # install spacy
 python3 -m spacy download en 

Once the environment is created, the model can be retrained or evaluated with the in-built dataset or a custom dataset. The following codes run text classification on the in-built AG_NEWS dataset.

 %%bash
 cd EvolvedTransformer/
 python3 main.py 
 # run on Evolved Transformer’s encoder
 python3 main.py --evolved true  

A generalized Evolved Transformer block has been published as a Class in the PyTorch environment. Any custom architecture can incorporate this Class to build a new model on top of the Evolved Transformer. The following codes establish the EvolvedTransformerBlock Class.

 from models.embedder import Embedder, PositionalEncoder
 import math
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from models.gated_linear_unit import GLU
 class EvolvedTransformerBlock(nn.Module):
     def __init__(self,d_model,num_heads=8,ff_hidden=4):
         super(EvolvedTransformerBlock,self).__init__()
         self.attention = nn.MultiheadAttention(d_model, num_heads) 
         self.layer_norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(4)])
         self.feed_forward = nn.Sequential(
             nn.Linear(d_model,ff_hidden*d_model),
             nn.ReLU(),
             nn.Linear(ff_hidden*d_model,d_model),
         )
         self.glu = GLU(d_model,1)
         self.left_net = nn.Sequential(
             nn.Linear(d_model,ff_hidden*d_model),
             nn.ReLU()
         )
         self.right_net = nn.Sequential(
             nn.Conv1d(in_channels=d_model,out_channels=d_model//2,kernel_size=3,padding=1),
             nn.ReLU()
         )
         self.mid_layer_norm=nn.LayerNorm(d_model*ff_hidden)
         self.sep_conv=nn.Sequential(
             nn.Conv1d(in_channels=d_model*ff_hidden,out_channels=1,kernel_size=9,padding=4),
             nn.Conv1d(in_channels=1,out_channels=d_model,kernel_size=1)
         )
     def forward(self,x):
         glued = self.glu(self.layer_norms[0](x))+x
         glu_normed = self.layer_norms[1](glued)
         left_branch = self.left_net(glu_normed)
         right_branch = self.right_net(glu_normed.transpose(1,2)).transpose(1,2)
         right_branch = F.pad(input=right_branch, pad=(0,left_branch.shape[2]-right_branch.shape[2],0,0,0,0), mode='constant', value=0)
         mid_result = left_branch+right_branch
         mid_result = self.mid_layer_norm(mid_result)
         mid_result = self.sep_conv(mid_result.transpose(1,2)).transpose(1,2)
         mid_result = mid_result + glued
         normed = self.layer_norms[2](mid_result)
         normed=normed.transpose(0,1)
         attended = self.attention(normed,normed,normed,need_weights=False)[0].transpose(0,1) + mid_result
         normed = self.layer_norms[3](attended)
         forwarded = self.feed_forward(normed)+attended
         return forwarded 

Custom configurations to the pre-built model can be made with the following codes.

 import argparse
 import logging
 import utils
 logger = utils.get_logger()
 def str2bool(v):
     return v.lower() in ('true')
 parser = argparse.ArgumentParser()
 parser.add_argument("--batch",type=int,default=16)
 parser.add_argument("--evolved",type=str2bool,default=False)
 parser.add_argument("--epochs",type=int,default=10)
 parser.add_argument("--model_dim",type=int,default=32)
 parser.add_argument("--max_seq_len",type=int,default=200)
 parser.add_argument("--backend",type=str,default='auto',choices=['cpu', 'gpu','auto'])
 parser.add_argument('--ngrams',type=int,default=2)
 parser.add_argument('--train_split',type=float,default=0.95)
 def get_args():
     logger.info('Parsing arguments')
     args,unparsed = parser.parse_known_args()
     return args, unparsed 

TensorFlow Implementation of the Evolved Transformer

TensorFlow implementation of the Evolved Transformer is performed through the Tensor2Tensor (T2T) framework. It yields a highly-efficient pre-trained model that can be implemented in minimal time even in a CPU device. The following codes install Tensor2Tensor and its dependencies in the local machine or cloud environment.

!pip install tensor2tensor

For the custom application of the Evolved Transformer or to build architecture on top of it, a module is developed in the models Class of Tensor2Tensor framework that can be imported using the following command.

from tensor2tensor.models import evolved_transformer

Tensor2Tensor integrates a lot of famous models and datasets at one place. The following command gives the list of pre-trained models, datasets and suitable hyperparameters. Users can choose any model and problem of interest and run either in a terminal or as python code. 

!t2t-trainer --registry_help

The Evolved Transformer can be invoked along with an example translation problem using the following commands. Here the base CPU version of the model is run for the sake of simplicity. The WMT 2014 English-to-German translation task is chosen as our problem. It should be noted that customized training may take hours based on the device configuration.

Setting the initial parameters, defining the problem and the model,

 %%bash
 PROBLEM=translate_ende_wmt32k
 MODEL=evolved_transformer
 HPARAMS=evolved_transformer_base
 DATA_DIR=$HOME/t2t_data
 TMP_DIR=/tmp/t2t_datagen
 TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
 mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
 # Generate data
 t2t-datagen \
   --data_dir=$DATA_DIR \
   --tmp_dir=$TMP_DIR \
   --problem=$PROBLEM 

Training the model based on our requirements,

 %%bash
 # Train
 # *  If you run out of memory, add --hparams='batch_size=1024'.
 t2t-trainer \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR 

Running decoder with a sample input English-German text pair,

 %%bash
 # Decode
 DECODE_FILE=$DATA_DIR/decode_this.txt
 echo "Hello world" >> $DECODE_FILE
 echo "Goodbye world" >> $DECODE_FILE
 echo -e 'Hallo Welt\nAuf Wiedersehen Welt' > ref-translation.de
 BEAM_SIZE=4
 ALPHA=0.6
 t2t-decoder \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR \
   --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
   --decode_from_file=$DECODE_FILE \
   --decode_to_file=translation.en 

Visualizing the translation performance for the predefined test text,

 %%bash
 # See the translations
 cat translation.en 

Evaluating the BLEU score metric and comparing it with a reference value

 %%bash
 # Evaluate the BLEU score
 t2t-bleu --translation=translation.en --reference=ref-translation.de 

Wrapping up

The Evolved Transformer outperforms the Vanilla Transformer with a BLEU score of 29.8 with the WMT 2014 English-German translation task and has 37.6% lesser parameters. Thus it utilises relatively lesser memory and demonstrates greater computational efficiency. 

evolved transformer
Comparison of performance between the Vanilla Transformer and the Evolved Transformer (Source)

Further reading:

PS: The story was written using a keyboard.
Picture of Rajkumar Lakshmanamoorthy

Rajkumar Lakshmanamoorthy

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed