Now Reading
How ELECTRA outperforms RoBERTa, ALBERT and XLNet

How ELECTRA outperforms RoBERTa, ALBERT and XLNet

electra outperforms BERT models

ELECTRA achieves state-of-the-art performance in language representation learning by outperforming present leaders RoBERTa, ALBERT and XLNet. On the other hand, ELECTRA works efficiently with relatively less compute than any language representation learning methods. 

Representation learning methods in language modeling such as the BERT and its variants follow the Masked Language Modeling (MLM) pre-training method. In this approach, a subset of around 15% of the input tokens are masked before feeding into the model. In attention models such as the XLNet, attention to those tokens are masked to wrap up their identity. Denoising-autoencoder-like networks are trained on these input tokens to recover back the original tokens. These approaches behave like generators whose objective is to learn the original tokens provided noisy masked tokens. Because these models train only around 15% of the data provided, they require enormous computing power and time.

Register for our Workshop on How To Start Your Career In Data Science?

The newly introduced ELECTRA follows a different approach to all of its predecessor in language representation learning. Rather than being a generator to generate the original tokens, ELECTRA behaves like a discriminator. In this pre-training method, selected tokens are replaced with plausible tokens that a small masked language model synthetically generates. The objective of this ELECTRA model is to identify corrupted tokens from all of the input tokens. Thus ELECTRA behaves like a discriminator while almost every other masked language model behaves like a generator. This model’s key advantage is that it trains all the input data requiring very less compute power and time. 

ELECTRA is the short form of ‘Efficiently Learning an Encoder that Classifies Token Replacements Accurately’, introduced by Kevin Clark and Christopher D. Manning of Stanford University and Minh-Thang Luong and Quoc V. Le of Google Brain Research. Pre-training of ELECTRA requires a generator that receives masked inputs as in the case of BERT and generates replacement tokens. Once pre-trained, the generator is discarded, and the ELECTRA alone is employed in downstream natural language applications by fine-tuning it as per task. 

An example demonstrating pre-training of ELECTRA

Python Implementation of ELECTRA

Step-1: Create Environment

Requirements are Python 3+, TensorFlow 1.15, NumPy, SciPy and Scikit-Learn. Pre-training and fine-tuning for downstream applications require a GPU runtime environment. 

 pip install tensorflow==1.15
 pip install numpy
 pip install scipy
 pip install sklearn 

Step-2: Download Source Code

The following command downloads the ELECTRA source code to the local environment.

!git clone


Step-3: Create a data directory

Create a new directory to store vocabulary and other data.

 cd electra/
 mkdir DATA_DIR 

Checking proper download of necessary files and the creation of new directory DATA_DIR,

!ls electra -p


Step-4: Download Vocabulary

Vocabulary used in the ELECTRA model is available here. Download the file to the data directory using the following commands.

 cd electra/
 wget -O DATA_DIR/vocab.txt 


Step-5: Download Corpus for training

Download OpenWebTextCorpus (12GB) to the data directory as a zipped binary file.

 wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx" -O openwebtext.tar.xz && rm -rf /tmp/cookies.txt
 mv openwebtext.tar.xz electra/DATA_DIR/ 

Then untar the downloaded file using the following command

 cd electra/DATA_DIR
 tar xf openwebtext.tar.xz 

Step-6: Preprocess the data and tokenize

The following command preprocesses the downloaded data, tokenizes, converts into tfrecords and saves in a sub-directory named ‘pretrain_tfrecords’.

!python3 --data-dir $DATA_DIR --num-processes 5

Step-7: Pre-training ELECTRA’s small model

Pre-training of the small model is performed on the downloaded data using the following command. It should be noted that the process may consume more than 4 days on a Tesla v100 GPU device.

!python3 --data-dir $DATA_DIR --model-name electra_small_owt

Step-8: Evaluation of model

Evaluation is performed simply by rerunning the pre-training python file by changing the options provided in the following command.

See Also

!python3 --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'

Step-9: Fine-tuning for downstream applications

Fine-tuning can be done on specific-tasks. The data corresponds to the task must be downloaded to the data directory in the prescribed tokenized format. This script helps download all GLUE tasks to our data directory at once. The following command fine-tunes the model for the MNLI task. Users can opt for any available task by changing the task_names option.

!python3 --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"]}'

Performance of ELECTRA

Developers evaluated pre-trained ELECTRA on GLUE benchmark (General Language Understanding Evaluation) and SQuAD benchmark (Stanford Question Answering Dataset). A variety of tasks covering textual entailment, question-answer entailment, paraphrase, question paraphrase, textual similarity, sentiment and linguistic acceptability were performed with ELECTRA. Pre-training of base ELECTRA’s generator-discriminator networks pair was done with the same data as BERT, which consists of 3.3 billion tokens from Wikipedia and BookCorpus. Large ELECTRA was trained with the same data as XLNet, which consists of  data from the BERT’s dataset , ClueWeb, CommonCrawl and Gigaword. Different generator sizes and discriminator sizes as well as different algorithms were attempted to arrive at the best version of ELECTRA.

Selection of best generator and discriminator sizes
Selection of best performing Algorithm

On identical downstream small GLUE tasks with identical device configurations, ELECTRA base model performs well even when trained on a single GPU, scoring 5 GLUE points higher than a comparable BERT model and even outscoring the much larger GPT model and ELMo model.

Comparison of performance of ELECTRA-Base with ELMo, GPT and BERT models 

On GLUE big dataset tasks, ELECTRA’s large model outperforms RoBERTa, ALBERT and XLNet in most tasks, while the former consumes around a quarter of the compute power consumed by the later models during their respective pre-training.

Comparison of performance of ELECTRA-Large with RoBERTa, ALBERT and XLNet models on various downstream tasks (in GLUE score)

Similarly ELECTRA’s large model outperforms all of its preceders with SQuAD dataset versions 1.1 and 2.0. Increase in the number of training flops of ELECTRA leads to increase in the performance on both the GLUE datasets and the SQuAD datasets.

Comparison of performance different models on SQuAD dataset

Wrapping Up

ELECTRA is the present state-of-the-art in GLUE and SQuAD benchmarks. It is a self-supervised language representation learning model. It can be used to pre-train transformer networks using relatively little compute power. It performs replaced-token detection with the help of a generator composed of a masked learning model. ELECTRA is compute-efficient and works even with a small memory device (single GPU). It yields greater performance in almost all natural language downstream applications. The generator and discriminator networks of present ELECTRA can be fine-tuned to arrive at better models in future.

Note: Images and illustrations other than code outputs are obtained from original research paper.

Further reading:

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top