Various Benchmarks have played an important role in various domains of machine learning such as MNIST (LeCun et al., 1998), Caltech101 (Fei-Fei et al., 2006), CIFAR (Krizhevsky & Hinton, 2009), ImageNet (Deng et al., 2009).
However, there is a lack of standardized testbed for Reinforcement Learning algorithms. Various benchmarks released by OpenAI such as Procgen, Obstacle Tower by Unity Technologies are serving as a state of the art.
Nevertheless, In this article, we are going to do hands-on with training various RL agents across some classical control tasks such provided by OpenAI Gym.
These environments are designed specifically for low-dimensional tasks that provide quick evaluations and comparisons of RL algorithms.
Environments
The overview and details of various environments are cited within OpenAI Environments.
- Acrobot-v1
- The system has two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.
- First mentioned by – R Sutton, “Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding”, NIPS 1996.
- Cartpole-v1
- A pole is attached by a un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
- First mentioned by – AG Barto, RS Sutton, and CW Anderson, “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”, IEEE Transactions on Systems, Man, and Cybernetics, 1983.
- MountainCar-v0
- A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
- First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
- MountainCarContinous-v0
- A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal.
- First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
- Pendulum-v0
- The inverted pendulum swing-up problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.
Algorithms Covered
- A2C – Synchronous, deterministic variant of Asynchronous Advantage Actor-Critic (A3C)
- PPO2 – Successor to TRPO, is a family of first-order methods that use a few other tricks to keep new policies close to old. Here, is a great video to understand PPO in more detail.
- ACER – Sample Efficient Actor-Critic with Experience Replay (ACER) combines several ideas of previous algorithms. It uses multiple workers (as A2C), implements a replay buffer (as in DQN), uses Retrace for Q-value estimation, importance sampling and a trust region.
- ACKTR – Actor-Critic using Kronecker-Factored Trust Region (ACKTR) uses Kronecker-factored approximate curvature (K-FAC) for trust-region optimization.
The algorithms have detailed coverage in the link provided. However, the objective of this guide is to quickly implement the algorithms and compare them against various environments from classical control theory literature.
Implementation
For ease of implementation, we have leveraged the Stable-Baselines package from OpenAI.
The following notebook implements the above-mentioned algorithms to train an RL agent for 1 million episodes across the classic control environments.
The agents were further tested on 1000 timesteps across all the environments.
The entire implementation was done using Macbook pro.
Comparative Study
This section provides a comparative summary of various algorithms across different environments as provided by OpenAI Gym.
The above figure shows the average runtime comparison for various RL algorithms across different environments which were used to train the agent.
We can infer that PPO2 is taking more than 350 seconds on average to train in various environments.
The above figure shows the average rewards collected while testing for 1000 steps on various environments using different algorithms.
The average reward earned while testing for 1000 steps is highest in the case of Pendulum-v0 followed by the CartPole-v1 environment.
The above figure shows the Average runtime across environments using different algorithms.
We can infer that the MountainCar-v0 environment takes the highest amount of time on an average to train the various agents.
Conclusion
Based on the insights above, we can conclude that Acrobot-v1 is difficult for most of the algorithms as it has the lowest mean reward irrespective of the magnitude.
PPO2 takes the maximum amount of time to train across environments, one of the possible reasons can be a lack of GPU in the hardware spec.