MITB Banner

Watch More

A Hands-On Guide on Training RL Agents on Classic Control Theory Problems

Various Benchmarks have played an important role in various domains of machine learning such as MNIST (LeCun et al., 1998), Caltech101 (Fei-Fei et al., 2006), CIFAR (Krizhevsky & Hinton, 2009), ImageNet (Deng et al., 2009).

However, there is a lack of standardized testbed for Reinforcement Learning algorithms. Various benchmarks released by OpenAI such as Procgen, Obstacle Tower by Unity Technologies are serving as a state of the art.

Nevertheless, In this article, we are going to do hands-on with training various RL agents across some classical control tasks such provided by OpenAI Gym.

These environments are designed specifically for low-dimensional tasks that provide quick evaluations and comparisons of RL algorithms.

Environments

The overview and details of various environments are cited within OpenAI Environments.

  • Acrobot-v1
    • The system has two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.
    • First mentioned by – R Sutton, “Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding”, NIPS 1996.
  • Cartpole-v1
    • A pole is attached by a un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
    • First mentioned by – AG Barto, RS Sutton, and CW Anderson, “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”, IEEE Transactions on Systems, Man, and Cybernetics, 1983.
  • MountainCar-v0
    • A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
    • First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
  • MountainCarContinous-v0
    • A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal.
    • First mentioned by – A Moore, Efficient Memory-Based Learning for Robot Control, Ph.D. thesis, University of Cambridge, 1990.
  • Pendulum-v0
    • The inverted pendulum swing-up problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.

Algorithms Covered

The algorithms have detailed coverage in the link provided. However, the objective of this guide is to quickly implement the algorithms and compare them against various environments from classical control theory literature.

Implementation

For ease of implementation, we have leveraged the Stable-Baselines package from OpenAI.

The following notebook implements the above-mentioned algorithms to train an RL agent for 1 million episodes across the classic control environments.

The agents were further tested on 1000 timesteps across all the environments.

The entire implementation was done using Macbook pro.

Comparative Study

This section provides a comparative summary of various algorithms across different environments as provided by OpenAI Gym.

The above figure shows the average runtime comparison for various RL algorithms across different environments which were used to train the agent.

We can infer that PPO2 is taking more than 350 seconds on average to train in various environments.

The above figure shows the average rewards collected while testing for 1000 steps on various environments using different algorithms.

The average reward earned while testing for 1000 steps is highest in the case of Pendulum-v0 followed by the CartPole-v1 environment.

The above figure shows the Average runtime across environments using different algorithms.

We can infer that the MountainCar-v0 environment takes the highest amount of time on an average to train the various agents.

Conclusion

Based on the insights above, we can conclude that Acrobot-v1 is difficult for most of the algorithms as it has the lowest mean reward irrespective of the magnitude.

PPO2 takes the maximum amount of time to train across environments, one of the possible reasons can be a lack of GPU in the hardware spec.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Anurag Upadhyaya

Anurag Upadhyaya

Experienced Data Scientist with a demonstrated history of working in Industrial IOT (IIOT), Industry 4.0, Power Systems and Manufacturing domain. I have experience in designing robust solutions for various clients using Machine Learning, Artificial Intelligence, and Deep Learning. I have been instrumental in developing end to end solutions from scratch and deploying them independently at scale.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories