Practical applications of natural language processing have completely been revolutionised with the advent of pre-trained models. Not only has it democratised the development of ML applications allowing amateurs to build applications but also help experts to achieve better results without training a model from scratch.
Pre-trained models have also proved to be a beneficial source for amateur professionals to try and learn from the existing framework, which can further be fine-tuned to create innovative applications. Pre-trained models are super simple to incorporate and don’t require much-labelled data to work with, which makes it versatile for many business problems from prediction, transfer learning to feature extraction.
Here are the top eight pre-trained language models that have accelerated natural language processing applications in the real world.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Also Read: A Curious Case Of Weight Poisoning In Pre-trained Models

1| OpenAI’s GPT-3
The successor to GPT and GPT-2, GPT-3 is one of the most controversial pre-trained models by OpenAI. This large scale transformer-based language model has been trained on 175 billion parameters, which is ten times more than any previous non-sparse language model available. The model has been trained to achieve strong performance on many NLP datasets, including tasks like translation, answering questions, as well as several tasks that require on-the-fly reasoning such as unscrambling words. With its recent advancements, it has been used even to write news articles and generate codes helping developers to build ML applications. GPT-3 is the largest model so far, and its impressive capabilities have positioned it to outrank other text prediction models. In June, this year, the company released its API for allowing the users to access the new AI models virtually.
Also Read: How OpenAI’s GPT-3 Can Be Alarming For The Society
2| Google’s BERT
Bidirectional Encoder Representations from Transformers — BERT, is a pre-trained NLP model developed by Google in 2018. With this, anyone in the world can train their own question answering models in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The company, with the release, has showcased its performance on 11 NLP tasks including the very competitive Stanford questions dataset. Unlike other language models, BERT has only been pre-trained on 2,500 million words of Wikipedia and 800 million words of Book Corpus and has been successfully used to pre-train a deep neural network. According to researchers, BERT has achieved 93.2% accuracy, which surpasses previous results of accuracy.
Also Read: Behind Google’s BERT Implementation In Search Queries
3| Microsoft’s CodeBERT
Microsoft’s CodeBERT, with ‘BERT’ suffix referring to Google’s BERT framework for NLP, has been built upon a bidirectional multi-layer neural architecture. By understanding the connection between natural language and programming language, the model can support tasks such as code search, code documentation generation etc. CodeBERT has also been evaluated on NL-PL tasks by fine-tuning model parameters and following that it achieved excellent performance on both natural language code search and code documentation generation. The model has further been trained on the large dataset from Github code repositories in six programming languages, including 2.1 million bimodal data points and 6.4 million unimodal codes.
Also Read: Microsoft Introduces First Bimodal Pre-Trained Model for Natural Language Generation
4| ELMo
ELMo, also known as Embeddings from Language Models is a deep contextualised word representation that models syntax and semantic of words as well as their linguistic contexts. The model, developed by Allen NLP, has been pre-trained on a huge text-corpus and learned functions from deep bi-directional models (biLM). ELMo can easily be added to the existing models, which drastically improves the functions across vast NLP problems, including answering questions, textual entailment and sentiment analysis.
Also Read: How Having Bigger AI Models Can Have A Detrimental Impact On Environment
5| XLNet
XLNet by Google is an extension of the Transformer-XL model, which has been pre-trained using an autoregressive method to learn the functions from bidirectional contexts. Not only it can perform NLP tasks such as text classification, analysing sentiments, answering questions, along with the essential GLUE benchmark for English, but also many a time has outperformed BERT in many NLP tasks. According to researchers, XLNet has surpassed BERT in 20 tasks such as SQuAD, GLUE, and RACE. Also, this model does not undergo the pre-train fine-tune discrepancy that BERT has been subjected to, eliminating the independence assumption.
Also Read: NLP Gets A Surprise Addition As XLNet Outperforms BERT
6| Google’s ALBERT
Google ALBERT is a deep-learning NLP model, an upgrade of BERT, which has advanced on 12 NLP tasks including the competitive SQuAD v2.0 and SAT-style comprehension RACE benchmark. The model has been released as an open-source implementation on the TensorFlow framework and includes many ready-to-use pertained language representation models. The model further uses 89% fewer parameters than the BERT model — only 12M parameters and with way less loss of accuracy while evaluating, with an average of 80.1% accuracy. The model uses two optimisations to reduce model size — factorisation of the embedding layer and parameter-sharing across the hidden layers of the network.
Also Read: How Language Models Can Be Used In Real-Time Use Cases
7| ULMFiT
ULMFiT, also known as Universal Language Model Fine-tuning, is an effective transfer learning method which can be used to perform any sort of NLP tasks. The model performs significantly on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, the model has only been trained on 100 labelled examples to match the performance of model training from scratch on 100x more data. The creators have open-sourced their pre-trained models and code for developers to use.
Also Read: Training Models With Over 100 Billion Parameters
8| Facebook’s RoBERTa
Facebook’s RoBERTa is an optimised method for pre-training a self-supervised NLP system built on BERT’s language masking strategy. The model has been trained to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa modifies key hyperparameters in the model BERT allowing it to improve on the masked language modelling objective leading to better downstream task performance. The researchers are also training RoBERTa on more data than BERT and for a more extended amount of time. Researchers used existing unannotated natural language processing data sets as well as public news articles to train the model.