Cooperation is one of the biggest challenges in multi-agent models where complications arise from behavioural sophistication or involvement of a high number of agents. Effective cooperation among models trained by reinforcement learning involves having the right agents to cooperate. When agents have hidden or misaligned motivations and goals, the cooperation will fail, and so will the multi-agent model dynamics. A recent study by DeepMind and Harvard Research proposes a social deduction game, Hidden Agenda, that provides a 2D environment for studying learning agents in scenarios of unknown team alignment and improving the cooperation dynamics in multi-agent models.
Social deduction games
While multi-agent models are extremely beneficial, they always come with the possibility that the agents may hold conflicting motives instead of shared goals. Multi-agent research tends to study this conflict through the assumption of perfect information, but that is rarely the case with social intelligence in real life. Social deduction games, such as Mafia or One Night Werewolf, are stimulation to study the intent inference through team-based activities modelling cooperation under uncertainty.
In social deduction games, groups of players attempt to decipher each others’ hidden roles. They need to observe the other players’ actions to deduce their roles while still hiding their roles. Essentially, to succeed in the game, the player needs to learn about the other agent through various sources while remaining anonymous. This needs players to cooperatively work against the other team.
DeepMind and Harvard’s Hidden Agenda is a social deduction game to train multiple players in two fundamental groups. These teams are ‘Crewmates’ and ‘Imposters’. Crewmates have a numerical advantage with the goal to refuel their ship using energy cells scattered around, and Imposters have an informational advantage with the goal of halting the Crewmates. This means the Crewmates are unaware of the roles of the other players, but the Imposers have this knowledge. An environment is created where each player is randomly assigned a role and colour for their avatar at the start of each episode and initialised to a location on the game map.
“We show that Hidden Agenda admits a rich strategic space where behaviours like camping, chasing, pairing, and voting emerged during co-training of a population of reinforcement learning agents,” the paper claimed.
The game is played in two alternating phases, the Situation Phase and the Voting Phase. During the Situation Phase, players can move around the environment to collect green fuel cells and add them to their inventory that can hold up to 2 fuel cells at any given time. So, players can deposit the fuel down the grate on the map; the more deposited fuel, the closer the Crewmates are to winning. Meanwhile, the Imposters can use their special freezing beam to pause any Crewmate for the remainder of the game.
The Voting Phase is initiated when a Crewmate observes the Imposter’s freezing action or 200 timesteps have elapsed since the previous voting or the beginning of the game. During this phase, the agents are teleported to a voting room environment where they can cast public votes and, for 25 timesteps, see the previous votes made by all players. “By observing the voting of the rest of the populations, agents can start adapting their cooperative behaviour,” explained Jesus Rodriguez, Chief Scientist at Invector Labs. In the end, a player is teleported to jail for the rest of the game if they receive at least half of the final votes.
There are a few conditions that need to be met for the game to end. These include the Crewmates depositing enough fuel cells to power their ship, the Impostor being voted out, or all Crewmates being frozen or voted out.
Training in cooperation
The game’s training scheme involves five agents training together in multiple episodes. They are trained to play a single role, and all players play in each match, teaching each agent to adapt to the same fixed set of co-players with the same fixed roles. “However, these co-players learn as they do, making learning a dynamic interaction between agents. The Impostor agent must learn to make its behaviour similar to that of the Crewmate agents while hindering their objective. The Crewmate agents must thwart this by adapting their behaviour to achieve their objective,” the researchers explained.
The environment is run on a reward-punishment system where the agents receive team-based actions after each episode. The winners get +4 points while the losers get -4. Along with that, the agents receive small bonuses throughout the game for specific actions. The game consists of several hyperparameters that can be modified to favour the Crewmates or the Impostors. The researchers set it after conducting a hyperparameter sweep that determined the best set of parameters to elicit the greatest diversity of agent strategies.
Through Hidden Agenda, the researchers showcased that reinforcement learning agents can be trained to learn behaviours like partnering and voting without communicating in natural language.
Reinforcement learning architecture
DeepMind uses two reinforcement learning architectures to train and evaluate the agents in the environment. The first is a standard asynchronous advantage actor-critic (A3C) architecture consisting of a two-layered CNN with output channels. This is followed by a feed-forward network based on an MLP model whose output is further passed on to an LSTM layer. The architecture also consists of a layer to estimate the correct policies used by agents.
By bringing social deduction to 2D worlds, the researchers have concluded that despite hidden motivations and unreliable sources, the agents learnt how to play the game and exhibit distinct behaviours. Moreover, even traditional reinforcement learning agents can play this game after the small modifications.