How Transformers Are Making Headway In Reinforcement Learning

In 2017, Google researchers announced Transformers to the world in a NIPS paper titled “Attention is All You Need”. The self-attention mechanism-based novel neural network architecture has since been used in several NLP applications, GPT-3 being a good case in point.

Transformers–thanks to their ability to integrate information over long time horizons and scale to a large amount of data–have achieved success in domains such as language modelling and machine translation. 

Reinforcement learning using Transformers

Reinforcement learning is one of the three machine learning paradigms, along with supervised and unsupervised learning. Reinforcement learning is a goal-oriented algorithm that works towards attaining a complex goal where every correct step is rewarded. In other words, the right decisions are reinforced via incentivisation. Long Short Term Memory (LSTM) is a common technique for reinforcement learning.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

On the other hand, the applications of Transformers have been primarily limited to NLP-related tasks (think GPT and BERT). Transformers in NLP aim to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers are increasingly becoming a popular choice for tasks such as reinforcement learning.

Reinforcement learning estimates single steps using the Markov property to work on a task in time. However, it is also possible to formulate it as a sequence modelling problem to predict a sequence of actions that leads to a sequence of high rewards. Researchers have probed whether powerful and high capacity sequence prediction models that work well with other domains such as NLP can also provide effective solutions to the reinforcement learning problem.

Download our Mobile App

Recently, researchers from the University of California, Berkeley, explored how state-of-the-art Transformer architectures can be used to reframe reinforcement learning as a ‘one big sequence modelling’ problem by modelling distributions over sequences of states’ actions and rewards. In their paper, “Reinforcement Learning as One Big Sequence Modeling Problem”, the researchers observed the reframing significantly simplified a range of design decisions, such as–eliminating the requirement for separate behaviour policy constraints and other epistemic uncertainty estimators. The approach is applicable across domains, including dynamics prediction, long-horizon dynamics prediction, imitation learning, offline reinforcement learning, and goal conditioned reinforcement learning.

Decision Transformers

While Transformer architectures can model sequential data efficiently, their self-attention mechanism allows the layer to assign a reward by maximising the dot product of the query and key vectors and forming state-return associations. Therefore, Transformers can operate effectively with a distracting reward. Studies have shown Transformers enable better generalisation and transfer capabilities due to their ability to model a wide distribution of behaviours.

Recently, a team of researchers from Facebook AI and Google introduced a framework that abstracts reinforcement learning as a sequence modelling problem. The team presented a Decision Transformer, an architecture that presents reinforcement learning problems as conditional sequence modelling. Decision Transformer gives output by using a causally masked Transformer, unlike other approaches which fit value functions or compute policy gradients. For this particular study, the authors have used a GPT architecture to model trajectories autoregressively.

Decision Transformer architecture

The approach suggested by this study is similar to a typical reinforcement learning task where the ultimate goal is to find the shortest path on a graph; the reward is 0 when the agent is at the goal node and -1 in other cases. The Transformer predicts the next token. Optimal trajectories are obtained by generating a sequence of actions via conditioning. The proposed model has achieved policy improvements without dynamic programming.

Trajectory representation for learning meaningful patterns is one of the most important components of the Decision Transformer. The model here is fed with the sum of future rewards that give trajectory representations that fit further autoregressive training and generation. The token embeddings are obtained by projecting raw inputs to the embedding dimension to obtain linear layers for each modality.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.