With ChatGPT or GPT-3 reigning supreme, there’s a high chance it would have gotten you curious and interested in learning more about itself – or Large Language Models (LLMs). And now even Google is amping things up to compete with Microsoft-backed OpenAI’s phenomenal chatbot as they released its rival – Bard – which is powered by LaMDA, the LLM released two years ago.
If you’re eager to delve into the realm of LLMs and chatbots, exploring their origins and architecture, look no further. Here’s a compilation of the top papers to begin with:
Attention is All You Need
In 2017, Google researchers presented a new architecture for Transformer-based neural networks, specifically designed for NLP tasks. This architecture replaces the recurrent connections in traditional models with self-attention mechanisms, demonstrating that these mechanisms alone are sufficient for capturing long-range dependencies. This is the basis of the current SOTA models including ChatGPT.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
A Neural Probabilistic Language Model
In 2003, Yoshua Bengio, along with his colleagues from University of Montreal, published a paper that proved that the need for improving speed-up techniques while training was a much needed area of research in probabilistic language models. The neural networks approach for probability function improved on the existing n-gram models.
Training language models to follow instructions with human feedback
Published in March 2022 by OpenAI, this is one of the most important papers that explains the architecture of ChatGPT. This paper proposed that making language models larger and larger does not improve them and aligning them with human feedback fine-tunes the aspect of intent in the models. The result was called InstructGPT, which just like ChatGPT, was based on the paradigm of Reinforcement Learning with Human Feedback (RLHF) that was proposed by Christiano in 2017.
Download our Mobile App
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
A decoder-only transformer model, BLOOM, was introduced at the Big Science Workshop in December 2022. For democratisation of LLMs, the company decided to open-source it. It is trained on the ROOTS corpus, a dataset that contains 46 natural and 13 programming languages.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A year before GPT-3 was released by OpenAI, Google AI Language team released BERT, a model that can be fine-tuned with an additional output layer. This can be used to create SOTA models for language inference and simple question-answer tasks as well. Where this paper stands out is with the deep-bidirectional architecture, which improves on the transfer learning demonstrated by unsupervised pre-training.
BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage
In August 2022, Meta AI released BlenderBot 3, a 175 billion parameter dialogue model. This ChatGPT alternative can scour the internet, which makes it stand out from the current conversational bots. The bot is only available in the US at the moment. The paper also details the continual learning plan using the data collected from deployment.
Improving alignment of dialogue agents via targeted human judgements
In September 2022, DeepMind presented Sparrow, their own ChatGPT. Since there is no paper for ChatGPT, this is a perfect alternative to understand the working of the popular chatbot. The paper traces out the RLHF method for training Sparrow. This was the first chatbot that demonstrated that the use of human feedback for generative models that can help in building for complex goals.
Improving Language Understanding by Generative Pre-Training
Published by OpenAI in 2018, the paper describes how GPT can be used for a decoder-style LLM for generative modelling. This is one of the few papers that paved the way for OpenAI into NLP tasks by generative pre-training of a language model on a huge corpus of unlabeled text, that is then followed by discriminative fine-tuning for each specific task.
Scaling Laws for Neural Language Models
In 2020, OpenAI released a paper studying the empirical laws for scaling language model performance. The theoretical paper investigates the relationship between the size of a language model and its performance. The paper argues that there are scaling laws that govern the growth of a model’s performance with respect to its size, and provides empirical evidence to support this claim.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Facebook AI (Now Meta AI) presented a new pre-training method for sequence-to-sequence models that utilised denoising autoencoders in 2019. The paper argues that this approach can improve the performance of models on a range of NLP tasks, including text generation, translation, and comprehension. Similar to diffusion models, the process involves corrupting text with noise, and then denoising it to obtain the original text back.
Deep learning based text classification: A comprehensive review
A summary paper published in January 2021 traces how deep learning techniques have been outperforming traditional machine learning approaches in day-to-day NLP tasks. It is a compilation of 150 DL models categorised based on their neural network architecture and transformers involved.
Cramming: Training a Language Model on a Single GPU in One Day
Researchers from Maryland College Park took a different approach to increase the performance of language models. Instead of scaling the size of the model, they experimented with building a language model on a single GPU in just one day. The results were quite interesting.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Researchers from Stanford University figured out a way to build on the efficiency of transformers, as they were slow and memory hungry. In the paper, they describe a way of IO-aware exact attention algorithm for reducing the number of memory reads and writes between GPU and SRAM.