Listen to this story
|
Deepmind has been the pioneer in making AI models that have the capability to mimic a human’s cognitive ability to play games. Games are a common testbed to assess a model’s ability. After mastering games like Go, Chess and Checkers, Deepmind has launched DeepNash, an AI model that can play Stratego at an expert level.
Mastering a game like ‘Stratego’ is a significant achievement for AI research because it presents a challenging benchmark for learning strategic interactions at a massive scale. Stratego’s complexity is based on two key aspects. Firstly, there are 10535 possible states in the game, which is exponentially larger than Texas hold ’em poker(10164 states) and Go(10360 states). The second is that at the start of the game, any given situation in Stratego requires reasoning over 1066 possible deployments for each player.
DeepNash learns to play Stratego in a self-play model-free manner without the need for human demonstration. DeepNash outperforms previous state-of-the-art AI agents and achieves expert human-level performance in the most complex variant of the game, Stratego Classic.
The Nash Equilibrium
DeepNash, at its core, is based on a model-free reinforcement learning algorithm that is termed as Regularised Nash Dynamics(R-NaD).
Source: arxiv.org
DeepNash combines the concept of R-NaD with its deep neural network architecture and converges to an approximate ‘Nash equilibrium’ by directly modifying the underlying multi-agent learning dynamics. By this technique, DeepNash was able to beat the existing state-of-the-art AI methods in Stratego, even achieving an all-time best ranking of #3 on the Gravon games platform against human expert players.
Deepesh’s learning approach
DeepNash employs an end-to-end approach to employ the learning of the deployment phase. The model uses deep reinforcement learning coupled with a theoretic game approach in this phase. The goal of the model is to learn to approximate Nash equilibrium through self-play. This technique guarantees that the agent will perform well even against a worst-case opponent.
Stratego computationally challenges all existing search techniques due to search space intractability. To resolve this, DeepNash uses an orthogonal route without search and proposes a new method(R-Nad). This new model combines model-free reinforcement learning in self-play with a game theoretic algorithmic idea.
This combined approach does not require modelling private states from public data. However, the challenge with this approach is that of scaling up this model-free reinforcement learning approach with R-NaD for making self-play competitive against human experts in Stratego – a feat that remains yet to be achieved.
We learn a Nash equilibrium in Stratego through self-play and model-free reinforcement learning. The idea of combining model-free RL and self-play has been tried before, but it has been empirically challenging to stabilise such learning algorithms when scaling up to complex games.
Source: arxiv.com
The idea behind the R-NaD algorithm is that it is possible to define a learning update rule that provides a dynamical system that, in turn, reveals the existence of a Lyapunov function. This function decreases during learning, which in turn guarantees convergence to a fixed nash equilibrium.
Results
To test DeepNash’s capabilities, it is evaluated against both human expert players and the latest SOTA Stratego bots. The former test is performed on Gravon, a well-known online gaming platform for Stratego players. The latter is performed against known Stratego bots like Celsius, Asmodeus, PeternLewis, etc.
- Evaluation against Gravon: DeepNash was evaluated based on 50 ranked matches against top human players over the course of two weeks in April 2022. DeepNash managed to win 42 of these matches, which brings it to an 84 percent efficiency. Based on the classic Stratego ranking in 2022, DeepNash’s performance corresponds to a score of 1799, which makes DeepNash the third best player among all Gravon Stratego players. This result confirms that DeepNash has reached a human expert level in Stratego and that too only via self-play, without any help of existing human data.
- Evaluation against SOTA Stratego-bots: DeepNash goes up against several existing Stratego algorithm bots, including Probe, Master of the Flag, Demon of Ignorance, and Celsius 1.1, among others.
Source: arxiv.org
Inspite of training only with self-play, DeepNash achieves victory against all of the bots with an overwhelming majority. However, in a few matches that DeepNash lost against Celsius1.1, the latter took a high-risk strategy of getting a significant material advantage by capturing pieces with a high-ranking piece at the start of the game.
DeepNash is designed with the sole aim of learning a Nash equilibrium policy during training and learning the qualitative behaviour of a top player. DeepNash managed to generate a wide range of deployments which made it difficult for the human players to find patterns to exploit. DeepNash also demonstrated its capability to make non-trivial trade-offs between information and material, execute bluffs and take risks when needed.