DeepMind is known to push boundaries and has created a name for itself in coming out with path-breaking innovations – with Alpha Fold and Alpha Fold 2.0 Enformer, among many others. One favourite area of research for the Alphabet-owned London-headquartered firm has been games and deployment of AI and deep learning in it. After back to back innovations in the gaming space, DeepMind has taken a step further and created a system called Player of Games (PoG), whose structure and mechanism it has released in a research paper.
What makes Player of Games stand out is that it can perform well at both perfect and imperfect information games. Player of Games reaches strong performance in perfect information games such as Chess and Go; it also outdid the strongest openly available agent in heads-up no-limit Texas hold ’em Poker (Slumbot) and defeated the state-of-the-art agent in Scotland Yard.
What exactly is PoG
DeepMind clarifies that applications of traditional search suffer well-known problems in imperfect information games. The focus of evaluation has been on single domains. This is where Player of Games steps in by using a single algorithm with minimal domain-specific knowledge. Its search capabilities are well-suited across the fundamentally different game types. The paper says it is guaranteed to find an approximate Nash equilibrium by resolving subgames to remain consistent during online play. It also yields low exploitability in practice in small games where exploitability is computable.
Perfect and imperfect information
DeepMind’s own invention AlphaGo did not have the capability to play Chess, but what came after that, AlphaZero, could. But, AlphaZero could not play Poker (imperfect information). DeepMind says that strong poker play has relied on game-theoretic reasoning to hide private information properly.
It adds, “Initially, super-human poker agents were based primarily on computing approximate Nash equilibria offline. Search was then added and proved to be a crucial ingredient to achieve super-human success in no-limit variants.”
Whatever advances have taken place till now, the focus has been on a single game with clear uses of domain-specific knowledge and structure to perform well.
Growing-tree counterfactual regret minimisation mechanism
DeepMind says that PoG uses growing-tree counterfactual regret minimisation (GT-CFR). It builds subgames non-uniformly, expanding the tree toward the most relevant future states while iteratively refining values and policies. It also uses self-play that trains value-and-policy networks using both game outcomes and recursive sub-searches applied to situations that came up in previous searches.
What sets it apart from contemporaries?
DeepMind said that Player of Games is the first algorithm to achieve strong performance in challenging domains with both perfect and imperfect information. A few AI-based game innovations that already exist include:
- AlphaZero – One of the most pioneering innovations of deploying AI in games, AlphaZero taught itself to play Go, Chess, and shogi (a Japanese version of Chess). It also managed to beat state-of-the-art programs specialising in these three games.
- MuZero – It also came from DeepMind itself two years after AlphaZero. MuZero can master games such as Go, Chess, and shogi from scratch (along with Atari) without being told the rules.
- DeepStack – DeepStack has managed to beat professional poker players at a two-player poker variant called heads-up no-limit Texas hold’em. Instead of building its strategy beforehand, DeepStack recalculated it at each step, considering the current state of the game.
- Libratus – It is an AI program built to play Poker, specifically heads up no-limit Texas hold ’em. Developed at Carnegie Mellon University, Pittsburgh, its creators intended for it to be generalisable to other, non-Poker-specific applications.
PoG’s performance against the four games
DeepMind said in the research paper that it trained a version of AlphaZero using its original settings in Chess and Go. It used 3500 concurrent actors using one TPUv4 each for a total of 800k training steps. PoG was also trained using a similar amount of the same resources.
In Chess, DeepMind evaluated PoG against Stockfish 8, level 20 and AlphaZero. PoG was run in training using 3 million training steps. Stockfish uses various search controls such as the number of threads and time per search during the evaluation process. DeepMind evaluated AlphaZero and PoG up to 60000 simulations. A tournament was also played between all of the agents at 200 games per pair of agents.
Image: DeepMind Player of Games
In Go, the company evaluated PoG (60000, 10) using a similar tournament as in Chess against two previous Go programs: GnuGo (at its highest level, 10) and Pachi v7.0.0 with 10k and 100k simulations and AlphaZero. The PoG network was trained for 1 million training steps.
How did PoG fare
Performing at the level of top human amateur
PoG performed strongly in both Chess and Go. In Chess, PoG (60000,10) is stronger than Stockfish using four threads and one second of search time. The paper added, “In Go, PoG (16000, 10) is more than 1100 Elo stronger than Pachi with 100,000 simulations”. PoG (16000, 10) also won 0.5% (2/400) of its games against the strongest AlphaZero (s=8000,t=800k). This means that PoG is performing at the level of a top human amateur, possibly even professional level. But, for both cases, PoG is weaker than AlphaZero.
Image: DeepMind Player of Games
In heads-up no-limit Texas hold ’em, the DeepMind team evaluated PoG against Slumbot2019 (openly-available computer poker player). PoG used randomised betting abstractions to reduce the number of actions from 20,000 to 4 or 5. Before evaluation, PoG (10, 0.1) has been trained for up to 1.1 million training steps.
In this game, the current state-of-the-art agent is based on Monte Carlo tree search (MCTS) with game-specific heuristic enhancements. This agent is called “PimBot”.
How did PoG fare
The research said that PoG is trained up to 17 million steps. While evaluating, they play a head-to-head match with PoG (400, 1) against PimBot at different numbers of simulations per search. PoG wins significantly (55% win rate) even against PimBot with 10 million search simulations while PoG searches a tiny fraction of the game.