In the last two articles, the focus has been to measure the generalization performance of Reinforcement learning agents using Gym Retro and Procgen environments.
Both these environments used 2-D environments and were limited to the first player arcade gaming experience. However, procgen is procedurally generated but it still has the limitations of 2-D and hardly requires high-level planning, vision, and control.
Recognizing these limitations, various game-based AI environments have been proposed for training RL agents for robust generalization.
Environment Overview – Obstacle Tower
VizDoom is one such prominent framework, While it features a first-person perspective and complex gameplay, the age of the game means that the graphics are relatively primitive. Furthermore, the only kind of randomization is in enemy movement and item spawning, as the level topologies are fixed.
The Obstacle Tower is a procedurally generated environment consisting of multiple floors to be solved by a learning agent. It is designed to test the learning agent’s abilities in computer vision, locomotion skills, high-level planning, and generalization. It combines platforming-style gameplay with puzzles, planning problems, and critically, increases in difficulty as the agent progresses.
Within each floor, the goal of the agent is to arrive at the set of stairs leading to the next level of the tower. These floors are composed of multiple rooms, each of which can contain their own unique challenges. Furthermore, each floor contains a number of procedurally generated elements, such as visual appearance, puzzle configuration, and floor layout. This ensures that in order for an agent to be successful at the Obstacle Tower task, they must be able to generalize to new and unseen combinations of conditions.
Different floors in Obstacle Tower,
Some of the features offered by Obstacle Tower are as follows
- High Visual Fidelity – The environment is rendered in 3D using real-time lighting and shadows, along with much more detailed textures.
- Procedurally Generated Visuals – There are multiple levels of variation in the environment, including the textures, lighting conditions, and object geometry.
- Physics Driven Interactions – The movement of objects within the environment is controlled using a real-time 3-D physics system.
Obstacle Tower has been specifically designed to measure the generalization of RL agents trained using the pixel-to-control approach.
Let’s understand the various environments specifications cited from the paper presented at AAAI conference.
- Dynamic Episodes – There are close to 100 floors, each consisting of two rooms. Each room has a puzzle to solve, obstacles to evade and key to unlock the door. The episodes terminate when the agent collides with an enemy or reaches the top of the floor.
- Observation Space – The observation space consists of two types of information, the first one being a 164×164 RGB array and the second one being a vector of non-visual information.
- Action Space – The environment provides multi-discrete action space which means it consists of a smaller set of discrete actions. The action space can also be flattened to use a single action.
- Reward Function – The environment supports dense and sparse rewards. A dense reward of 0.1 is provided for solving puzzles and opening the doors whereas a sparse reward of 1.0 is provided for completing the floor.
Training RL Agent using Obstacle Tower
Let’s train a Reinforcement Learning agent to learn to play and generalize using obstacle tower environment using CNN Policy and PPO2 as the optimization algorithm.
The agent was trained for 1,00,000 timesteps using Macbook pro under 35 minutes using PPO2 which supports GPU’s as well. The algorithm can be referred to in more detail here.
A reference guide to train the RL agent using Google’s Dopamine framework on GCP can be found here.
Generalization in Vision
We expect the agents with human-like capabilities. For instance, to understand two different doors under different lighting conditions. However, this is not the case and the agent performs badly and is unable to generalize well.
Generalization in Control
The agents are expected to exploit the determinism of the training environment as the obstacle tower has got different layouts of rooms on different floors. However, the agents perform poorly in test environments failing to generalize well.
Generalization in Planning
During planning the agent is expected to generalize well on unseen environments, which requires computationally intensive state discovery. In the case of procedurally generated episodes, it’s also not possible to have the same layout across the levels of obstacle tower.
Environments like obstacle tower can serve the research community to not only design more robust RL agents which can generalize with better vision, control, and planning but also as a more general customizable environment for the learning agents.
Some good resources to understand more about the Obstacle Tower competition organized by unity technologies.