Self-driving cars, industrial automation, machine translation, language models, gaming algorithms, data processing and recommendation systems are a few of the major disruptive technologies in artificial intelligence, and reinforcement learning is at the forefront. In fact, it is one of the most exciting areas in machine learning today.
RL has become a critical factor of AI development and in the quest for AGI because it does not depend on historical data sets. Instead, it learns through trial and error like human beings. Understanding this importance, the last few years have seen an accelerated pace in understanding and improving RL. Facebook, Google, DeepMind, Amazon, Microsoft, and all big tech companies invest significant time, money and effort in bringing out innovations in RL. The trial-error based learning approach enables the computer to make a series of decisions maximising a reward metric for the task without human intervention and without being explicitly programmed to achieve the task.
Read more about it here:
1. Comprehensive Guide To Deep Q-Learning For Data Science Enthusiasts
2. Three Things to Know About Reinforcement Learning
3. Top resources to learn reinforcement learning in 2022
Sign up for your weekly dose of what's up in emerging technology.
As popular as it may be, RL does not come without its challenges. Analytics India Magazine has noted some common RL challenges and ways to overcome them.
One of the major challenges with RL is efficiently learning with limited samples. Sample efficiency denotes an algorithm making the most of the given sample. Essentially, it is also the amount of experience the algorithm has to generate during training to reach efficient performance. The challenge is it takes the RL system a considerable amount of time to be efficient. For instance, DeepMind’s AlphaGoZero played five million Go games before beating the world champion in it.
A research paper by Gen Li Princeton et al. described this as, “Given that the state space and the action space could both be unprecedentedly enormous, it is often infeasible to request a sample size exceeding the fundamental limit set forth by the ambient dimension in the tabular setting. As a result, the quest for sample efficiency cannot be achieved in general without exploiting proper low-complexity structures underlying the problem of interest.” An IEEE paper introduced the ‘safe set algorithm’ as a solution. The algorithm monitors and modifies control and evaluates RL in a clustered dynamic environment, challenging existing RLs.
“When combined with the unavailability of code and models, the result is that the approach is very difficult, if not impossible, to reproduce, study, improve upon, and extend,” said Facebook research in their quest to reproduce DeepMind’s AlphaZero (the team succeeded ultimately).
Neural networks are opaque black boxes whose workings are mysteries to even the creators. They are also increasing in size and complexity, backed by huge datasets, computing power and hours of training. These factors make RL models very difficult to replicate.
In recent years, there’s been a growing movement in AI to counteract the so-called reproducibility crisis, a high-stakes version of the classic it-worked-on-my-machine coding problem. The crisis manifests in problems ranging from AI research that selectively reports algorithm runs to idealised results courtesy of heavy GPU firepower.
The Leiden Institute of Advanced Computer Science paper suggests leveraging the ‘minimal traces’ concept. The idea supports “re-simulation of action sequences in deterministic RL environments”, allowing reviewers to verify, re-use, and manually inspect experimental results without needing large compute clusters. Other solutions include tracking and logging experiments, submitting code and creating a metadata repository.
Second, rather than having entrants submit their agents, which could conceivably be trained with research-lab levels of GPU wattage, they’re required to submit code trained using the organisers’ machine. Finally, they also introduce randomising elements to make sure results track across different game versions.
Performing in real-life scenarios
RL agents learn from exploration of the artificial environments. Talking about AlphaZero, DeepMind said, “Through reinforcement learning (RL), this single system learnt by playing round after round of games through a repetitive process of trial and error.” The agents can fail in manufactured environments, but they do not have the opportunity to fail and learn in real-life scenarios. Usually, in real environments, the agent lacks the space to observe the environment well enough to use past training data to decide on a winning strategy. This also includes the reality gap, where the agent cannot gauge the difference between the learning simulation and the real world.
General techniques used by researchers include learning by mimicking the behaviour desired, learning through accurate simulations, better algorithm design and demonstrations and, most popularly used, training the agents on a reward and punishment mechanism. Since the agent is rewarded for correct actions and punished for incorrect ones, it is trained to maximise the right ones.
The reward technique discussed in the previous point is not foolproof. Since the rewards are sparsely distributed in the environment, a possible issue is an agent not observing the situation enough to notice the reward signals and maximise specific actions. This also occurs when the environment cannot provide reward signals in time; for instance, in many situations, the agent receives a green flag only when it is close enough to the target.
Curiosity-driven methods are widely used to encourage the agent to explore the environment and learn to tackle tasks in it. The researchers in the paper ‘Curiosity-driven exploration by self-supervised prediction’ proposed an Intrinsic Curiosity Module (ICM) to support the agent in exploration and prompt it to choose actions based on reduced errors. Another approach is curriculum learning, where the agent is presented with various tasks in ascending order of complexity. This imitates the learning order of humans.
While in real RL, the agent incrementally improves its policy with new experiences, offline RL works on a fixed set of logged experiences with minimal interaction with the environment. This method eliminates the need for repeated training of AI agents to scale. Still, it proposes the challenge where if the model, which is being trained with an existing dataset, takes action different from the data collection agent, one cannot determine the reward provided to the learning model.
Another issue, as suggested by Google AI, is the distributional shift. This occurs when the RL algorithms must learn to make decisions that differ from the decisions taken in the dataset to improve over the historical data.
The team developed a solution for this, an offline RL algorithm ‘conservative Q-learning (CQL)’, that guards “against overestimation while avoiding explicit construction of a separate behaviour model and without using importance weights.” Researchers have also found that online RL agents work well in the offline setting with sufficiently diverse datasets.