Standard Multi-Agent Reinforcement Learning (MARL) methods usually concentrate on the self-play (SP) setting where the strategies are being constructed by playing the game with themselves repeatedly. However, there may occur an error in some cases such as autonomous driving cars, cooperative game playing, among others.
Applying the Self-Play method to zero-shot coordination problem can produce agents that establish highly specialised conventions, which further does not have the capability to carry over to other novel partners that they have not been trained with.
To mitigate such a problem, researchers from Facebook AI research inspired by the zero-shot coordination, they constructed AI agents that can coordinate with other novel partners the agents have not seen before. The researchers introduced a learning approach known as Other-Play (OP), which enhances the Self-Play by searching for more robust strategies, exploiting the presence of known symmetries in the underlying problem.
Other-Play is an algorithm for constructing good strategies for the zero-shot coordination setting. The goal of this algorithm is to search for a method, which is robust to partners that are breaking symmetries in various ways while still playing in the same class. According to the researchers, this algorithm can be used for the advancement of autonomous driving cars as well as can be combined with many of the algorithmic innovations that have been developed to improve Self-Play in various games.
In self-play or self-training, the agent controls both players during training and iteratively improves both players’ strategies. The agent then uses this strategy at test time. If it converges, self-play finds a Nash equilibrium of the game and yields superhuman AI agents in two-player zero-sum games such as Chess, Go and Poker.
Also, in complex environments, self-play agents typically construct ‘inhuman’ strategies, which may be a benefit for zero-sum games. However, the main difference between using Self-play for poker and OP is that in poker, the abstractions are trying to find Nash equilibrium strategies in the original game while OP uses symmetries to select among a set of possible equilibria.
How OP Algorithm Works
The researcher developed Other-Play or OP algorithm for constructing good strategies for the zero-shot coordination setting. For example, considering the point of view of constructing a strategy for agent 1 where agent 2 will be the unknown novel partner.
In this case, the objective function of Other-Play for agent 1 maximises the expected return when randomly matched with a symmetry-equivalent policy of agent 2 rather than with a particular one. According to the researchers, they performed a version of Self-Play, where the agents are not assumed to be able to coordinate on exactly how to break symmetries.
Also, the algorithm Other-Play works by using reinforcement learning (RL) to maximise the reward when matched with agents, which are playing the same policy under the known symmetries.
Benefits of Other-Play
When the researchers studied the cooperative card game, Hanabi, it showed that OP agents achieve higher scores when paired with independently trained agents. The reason behind choosing Hanabi as an environment is because this game has been established as a benchmark environment for multi-agent decision making in partially observable settings.
OP can be said as a simple expansion of Self-Play and can be applied on top of any Self-Play algorithm. They also showed that the OP agents obtained higher average scores when paired with human players as compared to state-of-the-art Self-Play agents.
In this research, the main contributions of the authors are
- OP is introduced as a way of solving the zero-shot coordination problem.
- The algorithm is shown as the highest payoff meta-equilibrium for the zero-shot coordination problem.
- The researchers showed how to implement OP using deep reinforcement learning-based methods.
- OP is shown to be evaluated in the popular card game Hanabi.
Organisations have been using deep reinforcement learning and Self-Play strategies for their self-driving projects for a few years now. These methods help the agents to learn from its own experience, however, there are several shortcomings for which these cars are still not allowed to ride freely on a busy road. This new method “Other-Play” will eventually help the researchers to overcome these difficulties and solve the zero-shot coordination problem. Also, at the beginning of this year, tech giant Apple detailed its plan to make their self-driving project more robust, sophisticated using the deep reinforcement learning paradigm with self-play.
Read the full paper here.