MITB Banner

What’s Neural Population Learning?

NeuPL is an efficient general framework that learns and represents policies in symmetric zero-sum games within a single conditional network.

Share

The need for diverse policies for strategy games like StarCraft and poker are addressed by growing a robust policy population by iteratively training new policies against the existing ones. However, the approach has two challenges: Firstly, under a limited budget, the best response operators need truncating, resulting in under-trained good responses. Secondly, repeated learning of basic skills is wasteful and becomes intractable against stronger opponents.

Now, DeepMind and University College London have developed Neural Population Learning (NeuPL) to solve both issues. The researchers discovered that NeuPL guarantees the best responses under mild assumptions by showcasing a single conditional model policy. Moreover, NeuPL helps in transfer learning across policies. The research showed NeuPL can improve performance across various test domains. Additionally, it helps us understand how novel strategies are more accessible when the neural population increases. 

RTS and NeuPL

Classical game theory is crucial to learning population strategies. The study uses “rock-paper-scissors” as an example where two strategies (rock, paper) is obtainable. Meanwhile, a distinct population (scissors) can be defeated when both are set in opposition or revealed. It is also shown in Policy Space Response Oracle (PSRO). New policies are trained to respond to an amalgamation of policies with the help of a meta-strategy solver. Finally, a PSRO variation was used to overpower StarCraft in 2019.

In turn-based games, improving strategies means winning or losing. However, performance can sometimes be consequential when a population of pure strategies works against a single population. For example, picking strategies first can always beat a player going second in the meta-game.

There are two implications when the opponent cannot differentiate between good responses and globally optimal best responses. First, as a result, approximate best-response operators are often truncated prematurely as per hand-crafted schedules. Second, real-world games require strategy-agnostic transitive skills, prerequisites to strategic reasoning. However, studying these skills from zero against skilful opponents is difficult.

Using the computational infrastructure of simple self-play makes NeuPL an efficient and general framework that learns and represents policies in symmetric zero-sum games within a single conditional network. Most importantly, NeuPL allows transfer learning across policies that discover ways to overcome strong opponents that were previously inaccessible to comparable baselines.

When a game is completely transitive, every policy shares a similar best response policy. Self-play offers a natural curriculum that converges to this best response. Nevertheless, it is infeasible in real-world games since strategic cycles cannot be ruled out without an exhaustive policy search. In these games, NeuPL retains the ability to capture strategic cycles and falls back to self-play if the game appears transitive.

Previously, similar attempts were made to make population learning scalable. This study proposed pipeline PSRO (P2SRO), which learns iterative best responses simultaneously in a staggered, hierarchical way. It offered a principled way to use additional computation resources while retaining the convergence guarantee of PSRO.

Although it does not induce efficient learning computation per unit, it focuses on the lack of transfer learning across different domains with basic skills re-learned together. Instead, it proposed “Mixed-Oracles”, where knowledge acquired over previous iterations is accumulated via an ensemble of policies. Under this, each policy is trained to respond to a pure meta-game strategy rather than a mixture strategy suggested by the meta-strategy solver.

In comparison, NeuPL allows the transfer and optimises Bayes-optimal objectives head-on. The team also suggested an efficient, general and moral framework to learn and represent various policies in real-world games using a single conditional model. DeepMind and UCL claim the study is a step toward achieving scalable policy space. Furthermore, it seeks to go beyond the symmetric zero-sum setting as a possible direction for future research in this area.

Share
Picture of Akashdeep Arul

Akashdeep Arul

Akashdeep Arul is a technology journalist who seeks to analyze the advancements and developments in technology that affect our everyday lives. His articles primarily focus upon the business, cultural, social and entertainment side of the technology sector.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.