Last updated March 8, 2020

Vision, Control, Planning, and Generalization in RL

Share

Published on March 8, 2020

by Anurag Upadhyaya

In the last two articles, the focus has been to measure the generalization performance of Reinforcement learning agents using Gym Retro and Procgen environments.

Both these environments used 2-D environments and were limited to the first player arcade gaming experience. However, procgen is procedurally generated but it still has the limitations of 2-D and hardly requires high-level planning, vision, and control.

Recognizing these limitations, various game-based AI environments have been proposed for training RL agents for robust generalization.

Environment Overview – Obstacle Tower

VizDoom is one such prominent framework, While it features a first-person perspective and complex gameplay, the age of the game means that the graphics are relatively primitive. Furthermore, the only kind of randomization is in enemy movement and item spawning, as the level topologies are fixed.

Obstacle Tower environment was developed by Unity technologies for benchmarking RL agents on high fidelity, 3-D, 3rd Person perspective and procedurally generated environments.

The Obstacle Tower is a procedurally generated environment consisting of multiple floors to be solved by a learning agent. It is designed to test the learning agent’s abilities in computer vision, locomotion skills, high-level planning, and generalization. It combines platforming-style gameplay with puzzles, planning problems, and critically, increases in difficulty as the agent progresses.

Within each floor, the goal of the agent is to arrive at the set of stairs leading to the next level of the tower. These floors are composed of multiple rooms, each of which can contain their own unique challenges. Furthermore, each floor contains a number of procedurally generated elements, such as visual appearance, puzzle configuration, and floor layout. This ensures that in order for an agent to be successful at the Obstacle Tower task, they must be able to generalize to new and unseen combinations of conditions.

Different floors in Obstacle Tower,

Features

Some of the features offered by Obstacle Tower are as follows

High Visual Fidelity – The environment is rendered in 3D using real-time lighting and shadows, along with much more detailed textures.
Procedurally Generated Visuals – There are multiple levels of variation in the environment, including the textures, lighting conditions, and object geometry.
Physics Driven Interactions – The movement of objects within the environment is controlled using a real-time 3-D physics system.

Environment Specifications

Obstacle Tower has been specifically designed to measure the generalization of RL agents trained using the pixel-to-control approach.
Let’s understand the various environments specifications cited from the paper presented at AAAI conference.

Dynamic Episodes – There are close to 100 floors, each consisting of two rooms. Each room has a puzzle to solve, obstacles to evade and key to unlock the door. The episodes terminate when the agent collides with an enemy or reaches the top of the floor.
Observation Space – The observation space consists of two types of information, the first one being a 164×164 RGB array and the second one being a vector of non-visual information.
Action Space – The environment provides multi-discrete action space which means it consists of a smaller set of discrete actions. The action space can also be flattened to use a single action.
Reward Function – The environment supports dense and sparse rewards. A dense reward of 0.1 is provided for solving puzzles and opening the doors whereas a sparse reward of 1.0 is provided for completing the floor.

Training RL Agent using Obstacle Tower

Let’s train a Reinforcement Learning agent to learn to play and generalize using obstacle tower environment using CNN Policy and PPO2 as the optimization algorithm.

The agent was trained for 1,00,000 timesteps using Macbook pro under 35 minutes using PPO2 which supports GPU’s as well. The algorithm can be referred to in more detail here.
A reference guide to train the RL agent using Google’s Dopamine framework on GCP can be found here.

Challenges

Generalization in Vision

We expect the agents with human-like capabilities. For instance, to understand two different doors under different lighting conditions. However, this is not the case and the agent performs badly and is unable to generalize well.

Generalization in Control

The agents are expected to exploit the determinism of the training environment as the obstacle tower has got different layouts of rooms on different floors. However, the agents perform poorly in test environments failing to generalize well.

Generalization in Planning

During planning the agent is expected to generalize well on unseen environments, which requires computationally intensive state discovery. In the case of procedurally generated episodes, it’s also not possible to have the same layout across the levels of obstacle tower.

Environments like obstacle tower can serve the research community to not only design more robust RL agents which can generalize with better vision, control, and planning but also as a more general customizable environment for the learning agents.

References

Some good resources to understand more about the Obstacle Tower competition organized by unity technologies.

Competing in Obstacle Tower Challenge – Winner’s Blog

Obstacle Tower – AAAI Workshop Paper

Access all our open Survey & Awards Nomination forms in one place

Anurag Upadhyaya

Experienced Data Scientist with a demonstrated history of working in Industrial IOT (IIOT), Industry 4.0, Power Systems and Manufacturing domain. I have experience in designing robust solutions for various clients using Machine Learning, Artificial Intelligence, and Deep Learning. I have been instrumental in developing end to end solutions from scratch and deploying them independently at scale.