How Transformers Are Making Headway In Reinforcement Learning

In 2017, Google researchers announced Transformers to the world in a NIPS paper titled “Attention is All You Need”. The self-attention mechanism-based novel neural network architecture has since been used in several NLP applications, GPT-3 being a good case in point.

Transformers–thanks to their ability to integrate information over long time horizons and scale to a large amount of data–have achieved success in domains such as language modelling and machine translation. 

Reinforcement learning using Transformers

Reinforcement learning is one of the three machine learning paradigms, along with supervised and unsupervised learning. Reinforcement learning is a goal-oriented algorithm that works towards attaining a complex goal where every correct step is rewarded. In other words, the right decisions are reinforced via incentivisation. Long Short Term Memory (LSTM) is a common technique for reinforcement learning.

On the other hand, the applications of Transformers have been primarily limited to NLP-related tasks (think GPT and BERT). Transformers in NLP aim to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers are increasingly becoming a popular choice for tasks such as reinforcement learning.

Reinforcement learning estimates single steps using the Markov property to work on a task in time. However, it is also possible to formulate it as a sequence modelling problem to predict a sequence of actions that leads to a sequence of high rewards. Researchers have probed whether powerful and high capacity sequence prediction models that work well with other domains such as NLP can also provide effective solutions to the reinforcement learning problem.

Recently, researchers from the University of California, Berkeley, explored how state-of-the-art Transformer architectures can be used to reframe reinforcement learning as a ‘one big sequence modelling’ problem by modelling distributions over sequences of states’ actions and rewards. In their paper, “Reinforcement Learning as One Big Sequence Modeling Problem”, the researchers observed the reframing significantly simplified a range of design decisions, such as–eliminating the requirement for separate behaviour policy constraints and other epistemic uncertainty estimators. The approach is applicable across domains, including dynamics prediction, long-horizon dynamics prediction, imitation learning, offline reinforcement learning, and goal conditioned reinforcement learning.

Decision Transformers

While Transformer architectures can model sequential data efficiently, their self-attention mechanism allows the layer to assign a reward by maximising the dot product of the query and key vectors and forming state-return associations. Therefore, Transformers can operate effectively with a distracting reward. Studies have shown Transformers enable better generalisation and transfer capabilities due to their ability to model a wide distribution of behaviours.

Recently, a team of researchers from Facebook AI and Google introduced a framework that abstracts reinforcement learning as a sequence modelling problem. The team presented a Decision Transformer, an architecture that presents reinforcement learning problems as conditional sequence modelling. Decision Transformer gives output by using a causally masked Transformer, unlike other approaches which fit value functions or compute policy gradients. For this particular study, the authors have used a GPT architecture to model trajectories autoregressively.

Decision Transformer architecture

The approach suggested by this study is similar to a typical reinforcement learning task where the ultimate goal is to find the shortest path on a graph; the reward is 0 when the agent is at the goal node and -1 in other cases. The Transformer predicts the next token. Optimal trajectories are obtained by generating a sequence of actions via conditioning. The proposed model has achieved policy improvements without dynamic programming.

Trajectory representation for learning meaningful patterns is one of the most important components of the Decision Transformer. The model here is fed with the sum of future rewards that give trajectory representations that fit further autoregressive training and generation. The token embeddings are obtained by projecting raw inputs to the embedding dimension to obtain linear layers for each modality.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.