“We want AIs to make decisions, and reinforcement learning is the study of how to make decisions.” — Akshay Krishnamurthy, Principal Researcher at Microsoft
The reinforcement learning group at Microsoft Research have been working toward a collective goal, which is reinforcement learning for the real world. In the case of reinforcement learning, the Microsoft team mainly focussed on three different areas that include batch reinforcement learning, strategic exploration and representation learning.
Let’s take a look at the top reinforcement learning research papers by Microsoft Research accepted at the 34th Annual Conference on Neural Information Processing System — NeurIPS 2020.
Note: The list is no particular order
MOReL: Model-Based Offline Reinforcement Learning
About: In this work, the researchers presented MOReL, an algorithmic framework for model-based offline RL. This framework consists of two levels, which are learning a Pessimistic MDP (P-MDP) using the offline dataset and learning a near-optimal policy in this Pessimistic MDP (P-MDP). This enables it to help as a useful surrogate for purposes of policy evaluation and learning as well as overcome the common pitfalls of model-based reinforcement learning.
Know more here.
Multi-task Batch Reinforcement Learning with Metric Learning
About: In this paper, the researchers tackled the Multi-task Batch Reinforcement Learning problem. They proposed a novel application of the triplet loss and trained a policy from multiple datasets, each generated by interaction with a different task. They also measured the performance of the trained policy on unseen tasks sampled from the same task distributions as the training tasks.
Know more here.
Provably Good Batch Reinforcement Learning Without Great Exploration
About: Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. The researchers showed that a little adjustment to the Bellman optimality as well as evaluation back-up to get a more conservative update could have much more powerful guarantees. They focused on the algorithm families that are based on Approximate Policy Iteration (API) as well as the Approximate Value Iteration (AVI), which form the prototype of several model-free online and offline reinforcement learning algorithms.
Know more here.
Empirical Likelihood for Contextual Bandits
About: In this paper, the researchers proposed an estimator and confidence interval for computing the value of a policy from off-policy data in the contextual bandit setting. They also proposed an off-policy optimisation algorithm that searches for policies with large reward lower bound. The result shows that the proposed policy optimisation algorithm outperformed a strong baseline system for learning from off-policy data.
Know more here.
Safe Reinforcement Learning via Curriculum Induction
About: This paper introduces an alternative method, where an agent learns under the supervision of an automatic instructor that preserves the agent from disrupting constraints during the learning process. In this model, they introduced a monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Also, it does not require to know how the environment works. Instead, the model has a library of reset controllers that get activated when the agent starts behaving dangerously and thus preventing it from doing damage.
Know more here.
Constrained Episodic Reinforcement Learning in Concave-Convex and Knapsack Settings
About: In this paper, the researchers proposed an algorithm for tabular episodic reinforcement learning with constraints. The experiments demonstrated that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments.
Know more here.
FLAMBE: Structural Complexity and Representation Learning of Low-Rank MDPs
About: In this paper, the researchers developed a new algorithm, called FLAMBE for “Feature learning and model-based exploration”, that learns a representation for low-rank MDPs. FLAMBE engages in exploration and representation learning for provably efficient RL in low-rank transition models. This work mainly focused on the representation learning question — how can we learn such features?
Know more here.
Sample-Efficient Reinforcement Learning of Undercomplete POMDPs
About: This paper presented a sample-efficient algorithm, OOM-UCB, for episodic finite under-complete POMDPs, where the number of observations is greater than the number of latent states and where exploration is essential for learning, thus distinguishing the results from prior works.
Know more here.
Deep Reinforcement and InfoMax Learning
About: In this paper, the researchers introduced an objective-based model called Deep InfoMax (DIM), DIM trains the agent to predict the future by increasing the mutual information between its internal representation of successive time steps. They test the approach in several synthetic settings, where it successfully learns representations that are predictive of the future.
Know more here.
Provably Adaptive Reinforcement Learning in Metric Spaces
About: This paper provided a refined analysis of the Sinclair algorithm and highlighted its regret scales with the zooming dimension of the instance. The parameter which originates in the bandit literature captures the size of the subsets of near-optimal actions and is always smaller than the covering dimension used in previous analyses. As such, the results guarantee reinforcement learning in metric spaces.
Know more here.
The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning
About: Inspired by work from neuroscience on detecting model-based behaviour in humans and animals, in this paper, the researchers introduced an experimental setup to evaluate the model-based behaviour of RL methods. The metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment.
Know more here.
Policy Improvement via Imitation of Multiple Oracles
About: In this paper, the researchers proposed the state-wise maximum of the oracle policy’ values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimisation to online learning, they introduced a novel IL algorithm MAMBA, which can probably learn a policy competitive with this benchmark.
Know more here.
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
About: This paper introduces the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite-dimensional RKHS.
Know more here.