Last updated March 16, 2020

A Hands-On Guide on Training RL Agents on Classic Control Theory Problems

Published on March 16, 2020
by Anurag Upadhyaya

Various Benchmarks have played an important role in various domains of machine learning such as MNIST (LeCun et al., 1998), Caltech101 (Fei-Fei et al., 2006), CIFAR (Krizhevsky & Hinton, 2009), ImageNet (Deng et al., 2009).

However, there is a lack of standardized testbed for Reinforcement Learning algorithms. Various benchmarks released by OpenAI such as Procgen, Obstacle Tower by Unity Technologies are serving as a state of the art.

Nevertheless, In this article, we are going to do hands-on with training various RL agents across some classical control tasks such provided by OpenAI Gym.

These environments are designed specifically for low-dimensional tasks that provide quick evaluations and comparisons of RL algorithms.

Environments

The overview and details of various environments are cited within OpenAI Environments.

Acrobot-v1
- The system has two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.
- First mentioned by – R Sutton, “Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding”, NIPS 1996.
Cartpole-v1
- A pole is attached by a un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
- First mentioned by – AG Barto, RS Sutton, and CW Anderson, “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”, IEEE Transactions on Systems, Man, and Cybernetics, 1983.
MountainCar-v0
- A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
- First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
MountainCarContinous-v0
- A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal.
- First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
Pendulum-v0
- The inverted pendulum swing-up problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.

Algorithms Covered

A2C – Synchronous, deterministic variant of Asynchronous Advantage Actor-Critic (A3C)
PPO2 – Successor to TRPO, is a family of first-order methods that use a few other tricks to keep new policies close to old. Here, is a great video to understand PPO in more detail.
ACER – Sample Efficient Actor-Critic with Experience Replay (ACER) combines several ideas of previous algorithms. It uses multiple workers (as A2C), implements a replay buffer (as in DQN), uses Retrace for Q-value estimation, importance sampling and a trust region.
ACKTR – Actor-Critic using Kronecker-Factored Trust Region (ACKTR) uses Kronecker-factored approximate curvature (K-FAC) for trust-region optimization.

The algorithms have detailed coverage in the link provided. However, the objective of this guide is to quickly implement the algorithms and compare them against various environments from classical control theory literature.

Implementation

For ease of implementation, we have leveraged the Stable-Baselines package from OpenAI.

The following notebook implements the above-mentioned algorithms to train an RL agent for 1 million episodes across the classic control environments.

The agents were further tested on 1000 timesteps across all the environments.

The entire implementation was done using Macbook pro.

Comparative Study

This section provides a comparative summary of various algorithms across different environments as provided by OpenAI Gym.

The above figure shows the average runtime comparison for various RL algorithms across different environments which were used to train the agent.

We can infer that PPO2 is taking more than 350 seconds on average to train in various environments.

The above figure shows the average rewards collected while testing for 1000 steps on various environments using different algorithms.

The average reward earned while testing for 1000 steps is highest in the case of Pendulum-v0 followed by the CartPole-v1 environment.

The above figure shows the Average runtime across environments using different algorithms.

We can infer that the MountainCar-v0 environment takes the highest amount of time on an average to train the various agents.

Conclusion

Based on the insights above, we can conclude that Acrobot-v1 is difficult for most of the algorithms as it has the lowest mean reward irrespective of the magnitude.

PPO2 takes the maximum amount of time to train across environments, one of the possible reasons can be a lack of GPU in the hardware spec.

Access all our open Survey & Awards Nomination forms in one place >>

Anurag Upadhyaya

Experienced Data Scientist with a demonstrated history of working in Industrial IOT (IIOT), Industry 4.0, Power Systems and Manufacturing domain. I have experience in designing robust solutions for various clients using Machine Learning, Artificial Intelligence, and Deep Learning. I have been instrumental in developing end to end solutions from scratch and deploying them independently at scale.

Watch More

A Hands-On Guide on Training RL Agents on Classic Control Theory Problems

Environments

Algorithms Covered

Implementation

Comparative Study

Conclusion

Anurag Upadhyaya

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.