In its latest step towards general-purpose AI systems, DeepMind has proposed XLand, a virtual environment, to formulate new learning algorithms, which control how agent trains and the games on which it trains. XLand was introduced via a paper titled, “Open-Ended Learning Leads to Generally Capable Agents“, in which DeepMind researchers demonstrated a technique to train an agent capable of playing many different games without requiring human interaction data
Challenges with traditional reinforcement learning
The repetitive process of trial and error has proven effective in teaching computer systems to play many games, including chess, shogi, Go, and StarCraft II. However, one of the main challenges with reinforcement learning-trained systems is a lack of training data. Systems trained by reinforcement learning are unable to adapt their learned behaviours to new tasks because they are not trained on a broad enough set of tasks.
For instance, AlphaZero performed well against some of the world’s best chess, shogi, and Go programmes even though it was aware of only the game’s basic rules. However, the hitch was that since AlphaZero trained on each game through repetition, it was unable to learn a different game or task without having to do it all over again from scratch. This was true for other reinforcement learning games as well.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Determined to create AI agents that could better address the limitations of human intelligence, Deepmind created XLand. In addition to having a bigger range of possible games to work on, these agents will now be able to deal with entirely new situations and challenge themselves with new games and tasks, such as ones they have never seen before.
The DeepMind AI agents are represented by 3D virtual avatars that reside in a multiplayer online environment meant to mimic the physical world. Players’ surroundings are assessed through RGB images, and they learn how to interact with different games and genres.
XLand lets a user programmatically specify the game space, and hence the game space allows for data to be generated algorithmically and automatically. This is why the behaviours of the player characters significantly influence the artificial-intelligence (AI) agents in XLand. This complex, non-linear relationship between environment and behaviour gives rise to excellent data to train on because even minute alterations in the components of the environment can lead to radical changes in the challenges for virtual agents.
As each generation of systems acquires ever-increasing performance and robustness, their task-generating functions improve to reflect that growth, and in turn, each new generation adds their improved self to the multiplayer environment.
The team utilised a neural network structure that provides an attention mechanism. Further, to improve agents’ overall capabilities, Deepmind uses population-based training (PBT) to adjust the parameters of the dynamic task generation. Furthermore, they combine multiple training runs into a chain that each subsequent generation of agents can use to bootstrap off the previous generation.
After training five generations of virtual agents over the course of approximately 700,000 unique games and experiencing 200 billion training steps, DeepMind noted a significant rise in both learning and overall performance. This is true for all procedurally generated evaluation tasks except a handful that could not be completed even by a human. In addition, the team said, “Our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently.
According to DeepMind, in contrast to traditional, top-down approaches, the agents make frequent attempts at self-betterment through trial and error, exploring different states in search of satisfying outcomes. The systems displayed a wide range of patterns rather than highly optimised and specific patterns for specific tasks.