Best Transformer-based LLMs on Hugging Face (Part 2)

In June 2018, OpenAI released GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
Listen to this story

In Part 1, we discussed how Transformers form the crux of NLP. So let’s take a look at autoregressive and sequence-to-sequence models. 

Autoregressive Model

Autoregressive models are trained on the language modeling task, predicting the next word based on the context. They function as the decoder part of the original transformer model, using a mask to only consider previous words during attention. While these models can be fine-tuned for various tasks, their primary use is text generation. Let’s take a look at some of them. 

GPT: Improving Language Understanding by Generative Pre-Training  

In June 2018, OpenAI released GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results. By training a language model on diverse, unlabeled text and using task-aware input transformations during fine-tuning, significant improvements were achieved across various tasks. Outperforming task-specific models, this approach demonstrates remarkable gains, such as 8.9% in commonsense reasoning, 5.7% in question answering, and 1.5% in textual entailment. It shows the potential of unsupervised (pre-)training in enhancing performance on discriminative tasks and provides insights into the effectiveness of Transformers with data containing long-range dependencies.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

GPT-2: Language Models are Unsupervised Multitask Learners

An upgraded version of GPT, OpenAI introduced GPT 2 pretrained on a large dataset called WebText. By providing the model with a document and questions, it performs really well on the CoQA dataset without needing lots of training examples. Making the language model even bigger improves its performance on various tasks. GPT-2 outperformed other models on several language modeling datasets and produces more coherent text samples.

CTRL: A Conditional Transformer Language Model for Controllable Generation

In 2019, Salesforce introduced CTRL, a powerful 1.63 billion-parameter language model. CTRL used control codes that dictate style, content, and task-specific behavior. These codes are derived from natural text structure, enabling precise text generation while retaining unsupervised learning advantages. CTRL can also predict the most probable parts of training data for a sequence, offering a way to analyze vast data through model-based source attribution.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Carnegie Mellon University and Google worked on Transformer-XL for learning long-range dependencies without disrupting temporal coherence. It incorporates segment-level recurrence and a novel positional encoding scheme, enabling it to capture dependencies 80% longer than RNNs and 450% longer than traditional Transformers. This leads to improved performance on both short and long sequences and achieves remarkable speedup during evaluation, up to 1,800+ times faster than vanilla Transformers. 

Reformer: The Efficient Transformer

Again by Google Research, transformer based Reformer is designed with new methods to reduce memory usage and computation time. These tricks include using Axial position encoding to handle long sequences without a huge positional encoding matrix. Instead of traditional attention, it uses LSH (local-sensitive hashing) attention to save computation during attention layers. Intermediate results are not stored for each layer, but rather obtained during the backward pass using reversible transformer layers or recomputed as needed, saving memory. Additionally, feedforward operations are computed in smaller chunks instead of the entire batch. As a result, this model can handle much longer sentences compared to traditional autoregressive transformers.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Google’s XLNet rearranges the sentence’s tokens and then predicts the next token using the preceding tokens. This is achieved through a masked approach, where a specific permutation of the sentence is hidden, allowing the model to understand the correct sequence. Moreover, XLNet employs Transformer-XL’s recurrence mechanism to establish connections between distant tokens for long-term dependencies. The library includes various versions of the model suitable for language modeling, token classification, sentence classification, multiple choice classification, and question answering tasks.

Sequence-to-Sequence Models

Sequence-to-sequence models combine the transformer’s encoder and decoder to handle various tasks, including translation, summarization, and question answering. They can be adapted to different tasks but are most commonly used for translation, summarization, and question answering.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART by Meta is a sequence-to-sequence model for language tasks like writing, translation, and understanding. It consists of an encoder and a decoder. The encoder processes corrupted tokens, while the decoder works with the original tokens, masking future words. During pretraining, the encoder applies various transformations like masking random tokens, deleting tokens, masking spans of tokens with a single mask token, permuting sentences, and rotating the document to start at a specific token.

PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization

Following the same encoder-decoder model architecture as BART, PEGASUS employs two self-supervised objectives for pre-training: Masked Language Modeling (MLM) and Gap Sentence Generation (GSG) for summarization. In MLM, random tokens are masked and predicted by the encoder, akin to BERT. In GSG, entire encoder input sentences are masked and fed to the decoder, which has a causal mask to predict future words. Unlike BART, Pegasus’ pretraining task closely resembles summarization, where significant sentences are masked and generated together as one output sequence from the remaining sentences, resembling an extractive summary.

Marian: Fast Neural Machine Translation in C++

Microsoft developed Marian, a powerful and self-contained Neural Machine Translation system. It includes an integrated automatic differentiation engine based on dynamic computation graphs. Marian is entirely written in C++. The system’s encoder-decoder framework is designed to achieve both high training efficiency and fast translation speeds, making it a research-friendly toolkit.

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

In T5 by Google, the traditional transformer model is modified with positional embeddings learned at each layer. It handles various NLP tasks by transforming them into text-to-text challenges using specific prefixes like “summarize:”, “question:”, “translate English to German:”, etc. The pretraining involves both supervised and self-supervised training. Supervised training uses GLUE and SuperGLUE benchmarks as downstream tasks, converted into text-to-text format. Self-supervised training involves corrupting tokens in the input sentence, randomly removing 15% of them, and replacing them with individual sentinel tokens. The encoder takes the corrupted sentence, the decoder takes the original sentence, and the target includes the dropped-out tokens delimited by their sentinel tokens.

MBart: Multilingual Denoising Pre-training for Neural Machine Translation

Meta’s MBart has a similar structure and training goal as BART, but it stands out by being trained in 25 different languages. Its main purpose is to excel in both supervised and unsupervised machine translation tasks. MBart pioneers a novel approach by pre-training the entire sequence-to-sequence model on diverse languages, using denoising techniques on full texts.

Read more: Best Transformer-based LLMs on Hugging Face (Part 1)

Shritama Saha
Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox