MITB Banner

How Transformers Are Making Headway In Reinforcement Learning

Share

In 2017, Google researchers announced Transformers to the world in a NIPS paper titled “Attention is All You Need”. The self-attention mechanism-based novel neural network architecture has since been used in several NLP applications, GPT-3 being a good case in point.

Transformers–thanks to their ability to integrate information over long time horizons and scale to a large amount of data–have achieved success in domains such as language modelling and machine translation. 

Reinforcement learning using Transformers

Reinforcement learning is one of the three machine learning paradigms, along with supervised and unsupervised learning. Reinforcement learning is a goal-oriented algorithm that works towards attaining a complex goal where every correct step is rewarded. In other words, the right decisions are reinforced via incentivisation. Long Short Term Memory (LSTM) is a common technique for reinforcement learning.

On the other hand, the applications of Transformers have been primarily limited to NLP-related tasks (think GPT and BERT). Transformers in NLP aim to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers are increasingly becoming a popular choice for tasks such as reinforcement learning.

Reinforcement learning estimates single steps using the Markov property to work on a task in time. However, it is also possible to formulate it as a sequence modelling problem to predict a sequence of actions that leads to a sequence of high rewards. Researchers have probed whether powerful and high capacity sequence prediction models that work well with other domains such as NLP can also provide effective solutions to the reinforcement learning problem.

Recently, researchers from the University of California, Berkeley, explored how state-of-the-art Transformer architectures can be used to reframe reinforcement learning as a ‘one big sequence modelling’ problem by modelling distributions over sequences of states’ actions and rewards. In their paper, “Reinforcement Learning as One Big Sequence Modeling Problem”, the researchers observed the reframing significantly simplified a range of design decisions, such as–eliminating the requirement for separate behaviour policy constraints and other epistemic uncertainty estimators. The approach is applicable across domains, including dynamics prediction, long-horizon dynamics prediction, imitation learning, offline reinforcement learning, and goal conditioned reinforcement learning.

Decision Transformers

While Transformer architectures can model sequential data efficiently, their self-attention mechanism allows the layer to assign a reward by maximising the dot product of the query and key vectors and forming state-return associations. Therefore, Transformers can operate effectively with a distracting reward. Studies have shown Transformers enable better generalisation and transfer capabilities due to their ability to model a wide distribution of behaviours.

Recently, a team of researchers from Facebook AI and Google introduced a framework that abstracts reinforcement learning as a sequence modelling problem. The team presented a Decision Transformer, an architecture that presents reinforcement learning problems as conditional sequence modelling. Decision Transformer gives output by using a causally masked Transformer, unlike other approaches which fit value functions or compute policy gradients. For this particular study, the authors have used a GPT architecture to model trajectories autoregressively.

Decision Transformer architecture

The approach suggested by this study is similar to a typical reinforcement learning task where the ultimate goal is to find the shortest path on a graph; the reward is 0 when the agent is at the goal node and -1 in other cases. The Transformer predicts the next token. Optimal trajectories are obtained by generating a sequence of actions via conditioning. The proposed model has achieved policy improvements without dynamic programming.

Trajectory representation for learning meaningful patterns is one of the most important components of the Decision Transformer. The model here is fed with the sum of future rewards that give trajectory representations that fit further autoregressive training and generation. The token embeddings are obtained by projecting raw inputs to the embedding dimension to obtain linear layers for each modality.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.