In 2017, Google researchers announced Transformers to the world in a NIPS paper titled “Attention is All You Need”. The self-attention mechanism-based novel neural network architecture has since been used in several NLP applications, GPT-3 being a good case in point.
Transformers–thanks to their ability to integrate information over long time horizons and scale to a large amount of data–have achieved success in domains such as language modelling and machine translation.
Reinforcement learning using Transformers
Reinforcement learning is one of the three machine learning paradigms, along with supervised and unsupervised learning. Reinforcement learning is a goal-oriented algorithm that works towards attaining a complex goal where every correct step is rewarded. In other words, the right decisions are reinforced via incentivisation. Long Short Term Memory (LSTM) is a common technique for reinforcement learning.
On the other hand, the applications of Transformers have been primarily limited to NLP-related tasks (think GPT and BERT). Transformers in NLP aim to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers are increasingly becoming a popular choice for tasks such as reinforcement learning.
Reinforcement learning estimates single steps using the Markov property to work on a task in time. However, it is also possible to formulate it as a sequence modelling problem to predict a sequence of actions that leads to a sequence of high rewards. Researchers have probed whether powerful and high capacity sequence prediction models that work well with other domains such as NLP can also provide effective solutions to the reinforcement learning problem.
Recently, researchers from the University of California, Berkeley, explored how state-of-the-art Transformer architectures can be used to reframe reinforcement learning as a ‘one big sequence modelling’ problem by modelling distributions over sequences of states’ actions and rewards. In their paper, “Reinforcement Learning as One Big Sequence Modeling Problem”, the researchers observed the reframing significantly simplified a range of design decisions, such as–eliminating the requirement for separate behaviour policy constraints and other epistemic uncertainty estimators. The approach is applicable across domains, including dynamics prediction, long-horizon dynamics prediction, imitation learning, offline reinforcement learning, and goal conditioned reinforcement learning.
While Transformer architectures can model sequential data efficiently, their self-attention mechanism allows the layer to assign a reward by maximising the dot product of the query and key vectors and forming state-return associations. Therefore, Transformers can operate effectively with a distracting reward. Studies have shown Transformers enable better generalisation and transfer capabilities due to their ability to model a wide distribution of behaviours.
Recently, a team of researchers from Facebook AI and Google introduced a framework that abstracts reinforcement learning as a sequence modelling problem. The team presented a Decision Transformer, an architecture that presents reinforcement learning problems as conditional sequence modelling. Decision Transformer gives output by using a causally masked Transformer, unlike other approaches which fit value functions or compute policy gradients. For this particular study, the authors have used a GPT architecture to model trajectories autoregressively.
The approach suggested by this study is similar to a typical reinforcement learning task where the ultimate goal is to find the shortest path on a graph; the reward is 0 when the agent is at the goal node and -1 in other cases. The Transformer predicts the next token. Optimal trajectories are obtained by generating a sequence of actions via conditioning. The proposed model has achieved policy improvements without dynamic programming.
Trajectory representation for learning meaningful patterns is one of the most important components of the Decision Transformer. The model here is fed with the sum of future rewards that give trajectory representations that fit further autoregressive training and generation. The token embeddings are obtained by projecting raw inputs to the embedding dimension to obtain linear layers for each modality.