Reinforcement learning (RL) algorithm powers the brains of walking robots and AI chess grandmasters. The algorithm uses neat tricks (policies) that hunt targets by rewarding itself; by nudging itself to the destination.
Reinforcement learning systems rely on the framework of a Markov decision process (MDPs). MDPs in their ideal state are not easily available to the learning algorithm in a real-world environment. In practical and scalable real-world scenarios, RL systems usually run into the following challenges:
- the absence of reset mechanisms,
- state estimation
- reward specification
For example, in robotics, collecting high-quality data for a task is very challenging. To achieve the generalisation–what ML is all about– in robotics, it may require smarter reinforcement algorithms that take advantage of vast amounts of prior data unlike computer vision, where humans can label the data.
Learning to learn was first popularised by Juergen Schmidhuber in his 1987 thesis: meta-learning with genetic programming. As defined by Prof. Schmidhuber, “metalearning means learning the credit assignment method itself through self-modifying code. Meta Learning may be the most ambitious but also the most rewarding goal of machine learning. There are few limits to what a good meta learner will learn. Where appropriate it will learn to learn by analogy, by chunking, by planning, by subgoal generation, by combinations thereof – you name it.”
While RL is used for AutoML, automating RL hasn’t had much success. Unlike supervised learning, explained the authors, the RL design decisions that affect learning and performance are usually chosen through trial and error. AutoRL bridges this gap by applying the AutoML framework from supervised learning to the MDP setting in RL.
Now, to make reinforcement learning agents smarter, the researchers at Google have proposed a new method. In a paper titled, “Evolving Reinforcement Learning Algorithms”, the researchers have introduced a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimise. The learned algorithms perform independent of the domain they are operating, and can generalise to new environments not seen during training.
Algorithms That Evolve
(Source: Paper by Co-Reyes et al.,)
Previous works on learning RL algorithms applied meta-gradients, evolutionary strategies, and RNNs. The Google researchers represented the update rule as a computation graph that includes both neural network modules and symbolic operators. The resulting graph can be interpreted analytically and can optionally be initialised from known existing algorithms.
The researchers describe the RL algorithm as general programs with a domain specific language. “We target updates to the policy rather than reward bonuses for exploration,” they explained. The agent state, policy parameters and other factors are mapped to scalar loss, which will be used to optimise with gradient descent. The computational graph here is a directed acyclic graph (DAG) of nodes with typed inputs and outputs.
- Search is carried over programs with a maximum of 20 nodes, not including inputs or parameter nodes.
- Mutations occur with probability 0.95. Otherwise, a new random program is sampled.
- The search is done over 300 CPUs and runs for roughly 72 hours, at which point around 20,000 programs have been evaluated.
As shown in the illustration above, the mutator component produces a new algorithm by skimming through top-performing algorithms. The algorithm’s performance is then evaluated over a set of training environments, and the population is updated. This allows incorporation of existing knowledge by starting the population from known RL algorithms instead of purely from scratch.
For evaluating the learnability of the RL algorithms, the researchers have used the popular CartPole, and Lunar Lander challenges. If an algorithm succeeds on CartPole, it then proceeds to more challenging training environments. “For learning from scratch we also compare the effect of the number of training environments on the learned algorithm by comparing training on just CartPole versus training on CartPole and LunarLander,” they added.
The results show that this method is capable of automatically discovering algorithms on par with recently proposed RL research, and empirically attain better performance than deep Q-learning methods.
The paper focuses on task-agnostic RL update rules in the value-based RL setting that are both interpretable and generalisable. This work takes the best of reinforcement learning and AutoML techniques to bolster the domain of AutoRL. The contributions can be summarised as follows:
- Introduction of a new method that improves the “learning to learn” ability in algorithms.
- Introduction of a general language for representing algorithms which compute the loss function for value-based model-free RL agents to optimise.
- The two new learned RL algorithms perform good generalisation performance over a wide range of environments.
Find the original paper here.