Hugging Face has integrated the Decision Transformer, an offline reinforcement learning method, into the Hugging Face transformers library and the Hugging Face hub. Hugging Face plans to improve accessibility in the field of deep RL and looks forward to sharing them with users over the coming weeks.
The Decision Transformer model abstracts reinforcement learning as a conditional-sequence modelling problem. The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximise the return (cumulative reward), Hugging Face uses a sequence modelling algorithm (Transformer) that, given the desired return, past states, and actions, will generate future actions to achieve this desired return. It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
This is a complete shift in the reinforcement learning paradigm since they use generative trajectory modelling (modelling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms.