Top Reinforcement Learning Papers By Microsoft At NeurIPS 2020


“We want AIs to make decisions, and reinforcement learning is the study of how to make decisions.” — Akshay Krishnamurthy, Principal Researcher at Microsoft

The reinforcement learning group at Microsoft Research have been working toward a collective goal, which is reinforcement learning for the real world. In the case of reinforcement learning, the Microsoft team mainly focussed on three different areas that include batch reinforcement learning, strategic exploration and representation learning. 

Let’s take a look at the top reinforcement learning research papers by Microsoft Research accepted at the 34th Annual Conference on Neural Information Processing System — NeurIPS 2020.


Sign up for your weekly dose of what's up in emerging technology.

Note: The list is no particular order

MOReL: Model-Based Offline Reinforcement Learning

About: In this work, the researchers presented MOReL, an algorithmic framework for model-based offline RL. This framework consists of two levels, which are learning a Pessimistic MDP (P-MDP) using the offline dataset and learning a near-optimal policy in this Pessimistic MDP (P-MDP). This enables it to help as a useful surrogate for purposes of policy evaluation and learning as well as overcome the common pitfalls of model-based reinforcement learning.

Know more here.

Multi-task Batch Reinforcement Learning with Metric Learning

About: In this paper, the researchers tackled the Multi-task Batch Reinforcement Learning problem. They proposed a novel application of the triplet loss and trained a policy from multiple datasets, each generated by interaction with a different task. They also measured the performance of the trained policy on unseen tasks sampled from the same task distributions as the training tasks.

Know more here.

Provably Good Batch Reinforcement Learning Without Great Exploration

About: Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. The researchers showed that a little adjustment to the Bellman optimality as well as evaluation back-up to get a more conservative update could have much more powerful guarantees. They focused on the algorithm families that are based on Approximate Policy Iteration (API) as well as the Approximate Value Iteration (AVI), which form the prototype of several model-free online and offline reinforcement learning algorithms.

Know more here.

Empirical Likelihood for Contextual Bandits

About: In this paper, the researchers proposed an estimator and confidence interval for computing the value of a policy from off-policy data in the contextual bandit setting. They also proposed an off-policy optimisation algorithm that searches for policies with large reward lower bound. The result shows that the proposed policy optimisation algorithm outperformed a strong baseline system for learning from off-policy data.

Know more here.

Safe Reinforcement Learning via Curriculum Induction 

About: This paper introduces an alternative method, where an agent learns under the supervision of an automatic instructor that preserves the agent from disrupting constraints during the learning process. In this model, they introduced a monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Also, it does not require to know how the environment works. Instead, the model has a library of reset controllers that get activated when the agent starts behaving dangerously and thus preventing it from doing damage. 

Know more here.

Constrained Episodic Reinforcement Learning in Concave-Convex and Knapsack Settings

About: In this paper, the researchers proposed an algorithm for tabular episodic reinforcement learning with constraints. The experiments demonstrated that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments.

Know more here.

FLAMBE: Structural Complexity and Representation Learning of Low-Rank MDPs

About: In this paper, the researchers developed a new algorithm, called FLAMBE for “Feature learning and model-based exploration”, that learns a representation for low-rank MDPs. FLAMBE engages in exploration and representation learning for provably efficient RL in low-rank transition models. This work mainly focused on the representation learning question — how can we learn such features? 

Know more here.

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

About: This paper presented a sample-efficient algorithm, OOM-UCB, for episodic finite under-complete POMDPs, where the number of observations is greater than the number of latent states and where exploration is essential for learning, thus distinguishing the results from prior works.

Know more here.

Deep Reinforcement and InfoMax Learning

About: In this paper, the researchers introduced an objective-based model called Deep InfoMax (DIM), DIM trains the agent to predict the future by increasing the mutual information between its internal representation of successive time steps. They test the approach in several synthetic settings, where it successfully learns representations that are predictive of the future.

Know more here.

Provably Adaptive Reinforcement Learning in Metric Spaces

About: This paper provided a refined analysis of the Sinclair algorithm and highlighted its regret scales with the zooming dimension of the instance. The parameter which originates in the bandit literature captures the size of the subsets of near-optimal actions and is always smaller than the covering dimension used in previous analyses. As such, the results guarantee reinforcement learning in metric spaces. 

Know more here.

The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning

About: Inspired by work from neuroscience on detecting model-based behaviour in humans and animals, in this paper, the researchers introduced an experimental setup to evaluate the model-based behaviour of RL methods. The metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment. 

Know more here.

Policy Improvement via Imitation of Multiple Oracles

About: In this paper, the researchers proposed the state-wise maximum of the oracle policy’ values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimisation to online learning, they introduced a novel IL algorithm MAMBA, which can probably learn a policy competitive with this benchmark. 

Know more here.

PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

About: This paper introduces the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite-dimensional RKHS. 

Know more here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
Sreejani Bhattacharyya
Why is edtech falling first?

With the lockdown being imposed due to the COVID-19 pandemic and schools being shut down, the edtech startups witnessed some of their best times during 2020 and 2021.