Last updated March 15, 2021
In AI Mysteries

Hands-on Guide to The Evolved Transformer on Neural Machine Translation

Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT)

Published on March 15, 2021

by Rajkumar Lakshmanamoorthy

Evolved Transformer has been evolved with neural architecture search (NAS) to perform sequence-to-sequence tasks such as neural machine translation (NMT). Evolved Transformer outperforms Vanilla Transformer, especially on translation tasks with improved BLEU score, well-reduced model parameters and increased computation efficiency.

Recurrent neural networks showed good performance in sequence-to-sequence tasks over a long time. However, the numerous model parameters used in recurrent neural networks necessitated a computationally-effective alternative architecture. Convolutional neural networks, feed-forward networks outperformed traditional recurrent networks in specific tasks. But they could not achieve a generalized architecture to address any sequence problem.

Transformers have emerged as a better alternative and a generalized approach incorporating self-attention mechanisms. Transformers become more powerful in solving problems from various platforms, including text, audio, image, and video. A lot of variants and extensions of the Vanilla Transformer are developed in a task-specific manner. Few remarkable examples include the Vision Transformer and the ResNet-backed Vision Transformer for computer vision tasks, and the TransUNet for Medical Image Segmentation tasks. Similarly, the Google Brain team researchers David R. So, Chen Liang, Quoc V. Le have developed Evolved Transformer through the neural architecture search (NAS) approach targeting Neural Machine Translation tasks and they have succeeded!

Models developed by neural architecture search have begun to outperform human-developed models in many applications. Neural architecture search is proven to be better than Reinforcement Learning especially when the training resources are limited. In this work, the famous tournament selection architecture is applied to do a model search. The vanilla Transformer is employed to warm start the search. The search has been performed directly on the WMT 2014 English-German translation task with the newly-developed Progressive Dynamic Hurdles (PDH) algorithm. As a result of the search, a new model has been evolved that outperformed the vanilla Transformer on four well-established language tasks:

WMT 2014 English-German (En-De) translation task,
WMT 2014 English-French (En-Fr) translation task,
WMT 2014 English-Czech (En-Cs) translation task and
the 1 Billion Word Language Model Benchmark (LM1B).

This model has been named Evolved Transformer, shortly known as the ET.

PyTorch Implementation of the Evolved Transformer

The pre-built, pre-trained architecture of Evolved Transformer runs best in the GPU or TPU devices. The model and the necessary files can be downloaded from the source repository using the following command.

!git clone https://github.com/Shikhar-S/EvolvedTransformer.git

Proper download of files can be ensured by running the following command.

!ls EvolvedTransformer/

Output:

The environment with dependencies can be created using the following commands.

 %%bash
 cd EvolvedTransformer/
 pip3 install -r requirements.txt
 # install spacy
 python3 -m spacy download en

Once the environment is created, the model can be retrained or evaluated with the in-built dataset or a custom dataset. The following codes run text classification on the in-built AG_NEWS dataset.

 %%bash
 cd EvolvedTransformer/
 python3 main.py 
 # run on Evolved Transformer’s encoder
 python3 main.py --evolved true

A generalized Evolved Transformer block has been published as a Class in the PyTorch environment. Any custom architecture can incorporate this Class to build a new model on top of the Evolved Transformer. The following codes establish the EvolvedTransformerBlock Class.

 from models.embedder import Embedder, PositionalEncoder
 import math
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from models.gated_linear_unit import GLU
 class EvolvedTransformerBlock(nn.Module):
     def __init__(self,d_model,num_heads=8,ff_hidden=4):
         super(EvolvedTransformerBlock,self).__init__()
         self.attention = nn.MultiheadAttention(d_model, num_heads) 
         self.layer_norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(4)])
         self.feed_forward = nn.Sequential(
             nn.Linear(d_model,ff_hidden*d_model),
             nn.ReLU(),
             nn.Linear(ff_hidden*d_model,d_model),
         )
         self.glu = GLU(d_model,1)
         self.left_net = nn.Sequential(
             nn.Linear(d_model,ff_hidden*d_model),
             nn.ReLU()
         )
         self.right_net = nn.Sequential(
             nn.Conv1d(in_channels=d_model,out_channels=d_model//2,kernel_size=3,padding=1),
             nn.ReLU()
         )
         self.mid_layer_norm=nn.LayerNorm(d_model*ff_hidden)
         self.sep_conv=nn.Sequential(
             nn.Conv1d(in_channels=d_model*ff_hidden,out_channels=1,kernel_size=9,padding=4),
             nn.Conv1d(in_channels=1,out_channels=d_model,kernel_size=1)
         )
     def forward(self,x):
         glued = self.glu(self.layer_norms[0](x))+x
         glu_normed = self.layer_norms[1](glued)
         left_branch = self.left_net(glu_normed)
         right_branch = self.right_net(glu_normed.transpose(1,2)).transpose(1,2)
         right_branch = F.pad(input=right_branch, pad=(0,left_branch.shape[2]-right_branch.shape[2],0,0,0,0), mode='constant', value=0)
         mid_result = left_branch+right_branch
         mid_result = self.mid_layer_norm(mid_result)
         mid_result = self.sep_conv(mid_result.transpose(1,2)).transpose(1,2)
         mid_result = mid_result + glued
         normed = self.layer_norms[2](mid_result)
         normed=normed.transpose(0,1)
         attended = self.attention(normed,normed,normed,need_weights=False)[0].transpose(0,1) + mid_result
         normed = self.layer_norms[3](attended)
         forwarded = self.feed_forward(normed)+attended
         return forwarded

Custom configurations to the pre-built model can be made with the following codes.

 import argparse
 import logging
 import utils
 logger = utils.get_logger()
 def str2bool(v):
     return v.lower() in ('true')
 parser = argparse.ArgumentParser()
 parser.add_argument("--batch",type=int,default=16)
 parser.add_argument("--evolved",type=str2bool,default=False)
 parser.add_argument("--epochs",type=int,default=10)
 parser.add_argument("--model_dim",type=int,default=32)
 parser.add_argument("--max_seq_len",type=int,default=200)
 parser.add_argument("--backend",type=str,default='auto',choices=['cpu', 'gpu','auto'])
 parser.add_argument('--ngrams',type=int,default=2)
 parser.add_argument('--train_split',type=float,default=0.95)
 def get_args():
     logger.info('Parsing arguments')
     args,unparsed = parser.parse_known_args()
     return args, unparsed

TensorFlow Implementation of the Evolved Transformer

TensorFlow implementation of the Evolved Transformer is performed through the Tensor2Tensor (T2T) framework. It yields a highly-efficient pre-trained model that can be implemented in minimal time even in a CPU device. The following codes install Tensor2Tensor and its dependencies in the local machine or cloud environment.

!pip install tensor2tensor

For the custom application of the Evolved Transformer or to build architecture on top of it, a module is developed in the models Class of Tensor2Tensor framework that can be imported using the following command.

from tensor2tensor.models import evolved_transformer

Tensor2Tensor integrates a lot of famous models and datasets at one place. The following command gives the list of pre-trained models, datasets and suitable hyperparameters. Users can choose any model and problem of interest and run either in a terminal or as python code.

!t2t-trainer --registry_help

The Evolved Transformer can be invoked along with an example translation problem using the following commands. Here the base CPU version of the model is run for the sake of simplicity. The WMT 2014 English-to-German translation task is chosen as our problem. It should be noted that customized training may take hours based on the device configuration.

Setting the initial parameters, defining the problem and the model,

 %%bash
 PROBLEM=translate_ende_wmt32k
 MODEL=evolved_transformer
 HPARAMS=evolved_transformer_base
 DATA_DIR=$HOME/t2t_data
 TMP_DIR=/tmp/t2t_datagen
 TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
 mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
 # Generate data
 t2t-datagen \
   --data_dir=$DATA_DIR \
   --tmp_dir=$TMP_DIR \
   --problem=$PROBLEM

Training the model based on our requirements,

 %%bash
 # Train
 # *  If you run out of memory, add --hparams='batch_size=1024'.
 t2t-trainer \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR

Running decoder with a sample input English-German text pair,

 %%bash
 # Decode
 DECODE_FILE=$DATA_DIR/decode_this.txt
 echo "Hello world" >> $DECODE_FILE
 echo "Goodbye world" >> $DECODE_FILE
 echo -e 'Hallo Welt\nAuf Wiedersehen Welt' > ref-translation.de
 BEAM_SIZE=4
 ALPHA=0.6
 t2t-decoder \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR \
   --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
   --decode_from_file=$DECODE_FILE \
   --decode_to_file=translation.en

Visualizing the translation performance for the predefined test text,

 %%bash
 # See the translations
 cat translation.en

Evaluating the BLEU score metric and comparing it with a reference value

 %%bash
 # Evaluate the BLEU score
 t2t-bleu --translation=translation.en --reference=ref-translation.de

Wrapping up

The Evolved Transformer outperforms the Vanilla Transformer with a BLEU score of 29.8 with the WMT 2014 English-German translation task and has 37.6% lesser parameters. Thus it utilises relatively lesser memory and demonstrates greater computational efficiency.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

Impact of Lok Sabha Election on India's AI Progress

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the