How Meta’s off-belief learning enhances human-AI collaboration sans direct data sharing

Meta AI has introduced a flexible approach, off-belief learning, to make the AI model’s actions intelligible to humans.

According to an MIT study, reinforcement learning has grown multi-fold in the past decade. The turning point was when DeepMind’s AlphaGo defeated Lee Sedol, the world champion of Go in 2016.

The reward function is at the heart of reinforcement learning; However, the RL agents pick up unique behaviours and communication protocols during the learning process, rendering them unfit for real-world human-AI cooperation.


Sign up for your weekly dose of what's up in emerging technology.

To that end, Meta AI has introduced a flexible approach, off-belief learning, to make the AI model’s actions intelligible to humans. Off-belief learning looks for grounded communication; the goal is to find the optimal way to communicate without making any assumptions. 

Source: MIT technology review

Off-belief learning

The off-belief learning is a MARL algorithm that addresses the gap in earlier methods by controlling the cognitive reasoning depth, while converging to optimal behavior. OBL can also be applied iteratively to obtain a hierarchy of policies that converges to a unique solution, making it ideal for solving zero-shot coordination (ZSC) problems; the goal is to maximise the test-time performance between polices from independent training runs of the same algorithm (cross-play).

OBL has a lot of parallels with the Rational Speech Acts (RSA) framework. RSA assumes a speaker-listener setting with grounded hints and opens with a literal listener (LL) that takes the grounded information into account. RSA introduces a hierarchy of speakers and listeners, each level defined via Bayesian reasoning (i.e., a best response) given the level below. OBL is engineered to deal with highdimensional, complex Dec-POMDPs where agents have OBL to both act and communicate through their actions.

Zero-Shot Coordination – Self-play (SP) is one of the most common problem settings for learning in Dec-POMDPs, where a team of agents is trained and tested together. Optimal self play policies typically rely on arbitrary conventions, which the entire team can jointly coordinate on during training. However, many real-world problems require agents to coordinate with other unknown AI agents and humans at test time. This desiderata was formalised as the Zero-Shot Coordination (ZSC) setting, where the goal is stated as finding algorithms that allow agents to coordinate with independently trained agents at test time, a proxy for the independent reasoning process in humans. ZSC removes arbitrary conventions as optimal solutions and instead requires learning algorithms that produce robust and, ideally, unique solutions across multiple independent runs.

One of the big challenges in ZSC under partially observable settings is to determine how to interpret the actions of other agents and how to select actions that will be interpretable to other agents. OBL addresses this issue by learning a hierarchy of policies, with an optimal grounded policy at the lowest level, which does not interpret other agents’ actions at all.

The researchers introduced the OBL operator that computes π1 given any π0. If a common knowledge policy π0 is played by all agents up to τi, then agent i can compute a belief distribution Bπ0 (τ |τi) = P(τ |τi, π0) conditional on its AOH. This belief distribution fully describes the effect of the history on the current state.

The OBL operator is defined to be the operator that maps an initial policy π0 to a new policy π1 as follows:


The above equation proposes a simple algorithm for computing an off-belief learning policy in small tabular environments: compute Bπ0i) for each action observation history(AOH), and then compute Qπ0→π1i) for each AOH in ‘backwards’ topological order. However, such approaches for POMDPs are feasible only for small size problems. To apply value iteration methods,  the Bellman equation for Qπ0→π1 for each agent i is given as follows:


The Hanabi experiment

Researchers at Meta AI tested the methods in the more complex domain of Hanabi. Hanabi is a popular benchmark to test methodologies like MARL, theory of mind, and zero-shot coordination research.

Hanabi is a 2-5 player card game. The deck consists of 50 cards, split among five different colours (suits) and ranks, with each colour having three 1s, two 2s and 3s and 4s, and one 5. For 2-player game setting, each player maintains a 5-card hand. Here, the game gets tricky as each player can see their partner’s hand but not their own. Each team strives to play one card of each rank in each color in order from 1 to 5. The team shares 8 hint tokens and 3 life tokens. In each turn, a player must play or discard a card in their hand, or spend a hint token to provide a hint to their partner. Playing a card succeeds if it is the lowest-rank card in its color not yet played, otherwise it fails and loses a life token. Giving a hint consists of choosing a rank or a color that a partner’s hand contains and indicating all cards in the partner’s hand sharing that color or rank. Discarding a card or successfully playing a 5 regains one hint token. The team’s score is zero if all life tokens are lost, otherwise it is equal to the number of cards successfully played, giving a maximum possible score of 25.


The table shows that OBL with four different levels of scenarios manages to produce a better performance result in comparison to other models. 


Off-belief learning is a new method that can train optimal grounded policies, preventing agents from exchanging information through arbitrary conventions. When used in a hierarchy, each level adds one step of reasoning over beliefs from the previous level, thus providing a controlled

means of reintroducing conventions and pragmatic reasoning into the learning process. Crucially, OBL removes the ‘weirdness’ of learning in Dec-POMDPs, since, given the beliefs induced by the level below, each level has a unique optimal policy. Therefore, OBL at convergence can solve instances of the ZSC problem. Importantly, OBL’s performance gains under ZSC directly translate to state-of-the-art ad-hoc teamplay and human-AI coordination results, validating the “ZSC hypothesis”.

More Great AIM Stories

Kartik Wali
A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.