Reinforcement learning is one of the most important techniques used to achieve artificial general intelligence. However, it has various disadvantages that prevent researchers from achieving true AI. Since AI agents are trained to learn by hit and trial method, providing every possible real-world circumstance is a huge challenge. The world is yet to address these problems effectively, but in many real-world RL applications, there already exists a colossal amount of previously collected interaction data. Such existing information can be leveraged to make RL feasible and enable better generalisation by incorporating diverse prior experience through offline reinforcement learning.
Offline Reinforcement Learning
Unlike online RL, where the AI agent incrementally improves its policy as new experience becomes available, offline RL works on a fixed set of logged experiences without any further interaction with the environment. However, it can easily be adapted to the growing batch setting if needed. Off-policy RL eliminates the need for repeated training of AI agents to scale, assists in evaluating models with the existing dataset of interactions, and enables developers to deliver real-world impact quickly.
Although offline RL can decrease the requirement of computational power, it comes with challenges that restrict the use of off-policy RL. While adopting this technique, one fails to understand the reward that an agent collects during a mismatch between online interactions and any fixed dataset of logged interactions. In other words, if the model, which is being trained with an existing dataset, takes an action that is different from the data collection agent, one cannot determine the reward provided to the learning model.
Sign up for your weekly dose of what's up in emerging technology.
Overcoming Challenges In Offline Reinforcement Learning
Researchers at first trained a DQN agent on Atari 2600 games and logged the experience. Then they proposed a random ensemble mixture (REM) — a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates — to enhance generalisation of the model. Using the collected data, a new RL model was trained, which outperformed AI agents that were trained within an environment. The researchers noted that superior results could be achieved even without explicitly correcting for any distribution mismatch.
However, on various occasions, the off-policy RL agent either diverges or yields poor performance. Consequently, to fix such problems, researchers used regularisation techniques to keep the policy close to the dataset of offline interactions.
Therefore, the DQN model was again used to collect the dataset of 60 Atari 2600 games for 200 million frames each by using sticky actions to make problems more challenging. Five DQN was trained on each of the 60 games with different random initialisation and stored every state, action, reward, and next state, resulting in collecting 300 datasets. Similarly, data for QR-DQN was gathered for training and comparing offline models’ performance.
On training models with the collected data of online DQN and QR-DQN, the offline agents outperformed both the online models. Offline DQN underperforms fully-trained online DQN on all except a few games, where it achieves higher scores with the same amount of data. Offline QR-DQN, on the other hand, outperforms offline DQN and fully-trained DQN on most of the games. These results demonstrate that it is possible to optimise reliable agents offline using standard deep RL algorithms. Furthermore, the disparity between the performance of offline QR-DQN and DQN indicates the difference in their ability to exploit offline data.
Comparison Of New Offline RL Models
Although the offline RL demonstrated superior performance, there is a need for generalisation in off-policy RL. Consequently, researchers leveraged techniques from supervised learning that use an ensemble of models to enhance generalisation with two new offline RL agents: Ensemble-DQN and Random Ensemble Mixture (REM).
While the former is a simple extension of DQN that trains multiple Q-value estimates and averages them for evaluation, the latter combines various Q-value estimates and uses this random combination for robust training.
Further, offline REM agents’ performance was evaluated against the offline DQN and QR-DQN. Offline REM outperformed the other two off-policy models as it generalises better than the others.
Researchers attributed the failure of previous work to the size and diversity of data that was collected by the online RL agent. Unlike supervised learning, the performance of offline agents increases as the size of data increases. Besides, researchers used the first 200 million frames per game in the DQN dataset, which delivered exceptional results.
Thus, online RL agents work well in the offline setting with sufficiently diverse datasets. Although offline RL outperformed the online agents, it requires off-policy evaluation for hyperparameter tuning and early stopping to manage the training of such models effectively. Nevertheless, offline RL has shown promising results, which can lead to broader adoption of the technique over online RL.