Now Reading
Complete Guide To MBRL: Python Tool For Model-Based Reinforcement Learning

Complete Guide To MBRL: Python Tool For Model-Based Reinforcement Learning

The recent revolution of big data and its technologies has changed our world’s way of looking at the future. With developments in such technologies, many more aspects of Machine Learning & Artificial Intelligence have boomed. The things that can be implemented with such tremendous amounts of data are surreal and have opened many doors for creativity and innovation. Artificial Intelligence, in particular, has seen a lot of developments that have led to the industrial revolution and automation. With Artificial Intelligence, creating automated self-learning systems has become a lot easier than expected. However, one of the most intimidating discoveries in self-learning systems, making the use of Data Science, has been a topic known as Reinforcement Learning. Reinforcement Learning has been one of the hottest topics in recent years. But, what exactly is Reinforcement Learning? Let’s Have a look at it!

Reinforcement learning is the method of training machine learning models to enable them to make a sequence of decisions. The learning agent is created to achieve a goal in an uncertain and potentially complex, unknown environment. In Reinforcement Learning, through artificial intelligence techniques, the computer employs trial and error methods to solve the assigned problem. To get the machine to do what the programmer wants, the artificial intelligence gets rewards or penalties for the actions it performs. Its goal would be to maximize the total reward earned. Although the programmer sets the reward policy or the rules, the model is given no hints or suggestions for solving the problem. Now, it’s up to the model to figure out how to perform the task to maximize the reward, Making assumptions from totally random trials and finishing with sophisticated tactical solutions and newly learned skills. By leveraging the power of search and many trials, reinforcement learning is currently the most effective way to hint at a machine’s creativity.

Register for our upcoming Masterclass>>

Reinforcement learning is very different from supervised learning. In supervised learning, the training data has the answers in it, so the model is trained with the correct answer itself, whereas in reinforcement learning, there is no answer, but the reinforcement agent decides what to do to perform the given task and find a solution to the assigned problem. Due to the absence of a training dataset, it is bound to learn from its experience. Reinforcement Learning solves a specific problem where decision making is sequential, but the goal is long-term, such as automated game-playing, robotics, etc. Here we do not need to pre-program the agent, and it learns from its own experience and failures without any human intervention. The agent continues to do three things in particular to learn and explore the environment: take action, change state, remain in the same state, and get feedback

What is the MBRL Library?

Model-based Reinforcement Learning (MBRL) for continuous control is an area of research investigating machine learning agents explicitly modelling themselves by interacting with the world. MBRL can learn rapidly from a limited number of trials and enables programmers to integrate specialized domain knowledge into the learning agent about how the world environment works. The library MBRL-Lib is an Open-source Python Library created to provide shared foundations, tools, abstractions, and evaluations for continuous-action MBRL. 

The MBRL library helps in performing and creating: 

  • Software abstractions for lightweight and modular MBRL for programmers and researchers
  • Providing debugging and visualization tools for MBRL 
  •  Help perform reimplementations of the state-of-the-art MBRL methods that can be easily modified and extended.

Software Architecture for MBRL Library

MBRL algorithms involve the interplay of multiple components whose internal operations are often hidden from each other. The algorithm has been written in a way that makes it possible to alter some of the component choices, for example: replace the trajectory sampling method without affecting the others. The MBRL-Lib follows a “mix-and-match” approach, where new algorithms or variants of existing ones can be easily written and tested without a lot of code being involved. By minimizing the amount of code required for users of the library and making heavy use of configuration files and a large set of utilities to perform common operations, it uplifts performance and can accommodate new functionality. Using the library has been made as frictionless as possible. It provides high performance in sample complexity and running time, and well-tuned hyperparameters have been provided for creating algorithms. More importantly, when performance gaps exist, future improvements are provided to the components for the end-users.  

(Image Source: Original Research Paper)

Getting Started with Code

In this article, we will try to explore the functionalities of the MBRL-Lib library and create a demo Reinforcement Learning model where we will :

  • Gather data using an exploration policy
  • Train the dynamics model using all available data.
  • Do a trajectory on the environment, choosing actions with the planner, using the dynamics model to simulate environment transitions. 

The following code implementation is partially inspired by the official MBRL paper and Github repository. You can find the link to MBRL-Lib’s official git repository through the link here

Installing the Libraries 

The first step would be to install the necessary libraries required to create the model, you can use the following code to do so,

!pip install omegaconf 
!pip install mbrl
!pip install matplotlib==3.1.3
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 -f

Here we are using omegaconf, which will provide us support for merging configurations from multiple sources. It is important to note that to help the MBRL library execution, it requires the latest version of Pytorch, so we have installed that as well. Using MBRL-Lib, we will use an ensemble of neural networks (NNs) modelling Gaussian distributions and a trajectory optimizer agent that uses CEM. We will also rely on several of the utilities available in the mbrl.util module. Finally, we will wrap the dynamics model into a gym environment to plan action sequences.

Importing Dependencies

Now that we have installed the necessary components, we will be importing them one by one.

from IPython import display
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import torch
import omegaconf
import mbrl.env.cartpole_continuous as cartpole_env
import mbrl.env.reward_fns as reward_fns
import mbrl.env.termination_fns as termination_fns
import mbrl.models as models
import mbrl.planning as planning
import mbrl.util.common as common_util
import mbrl.util as util
%load_ext autoreload
%autoreload 2
mpl.rcParams.update({"font.size": 16})
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
Creating the Learning Environment

The ensemble model is trained to predict the environment’s dynamics, and the planner tries to find high-reward trajectories over the model dynamics.

First, we will instantiate the environment and specify the reward function and termination function with the gym environment wrapper and some utility objects. The termination function will tell the wrapper if an observation should cause a learning episode to end or not. A reward function is used to compute the value of the reward given an observation. 

#providing seed
seed = 0
env = cartpole_env.CartPoleEnv()
rng = np.random.default_rng(seed=0)
generator = torch.Generator(device=device)
obs_shape = env.observation_space.shape
act_shape = env.action_space.shape
# Evaluate the true rewards given an observation 
reward_fn = reward_fns.cartpole
# Know if an observation should make the episode end
term_fn = termination_fns.cartpole

MBRL-Lib uses Hydra to manage its configurations. You can think of the configuration object as a dictionary with key/value pairs and equivalent attributes that specify the model and its algorithmic options. 

trial_length = 200
num_trials = 10
ensemble_size = 5
cfg_dict = {
    # dynamics model configuration
    "dynamics_model": {
        "model": {
            "_target_": "mbrl.models.GaussianMLP",
            "device": device,
            "num_layers": 3,
            "ensemble_size": ensemble_size,
            "hid_size": 200,
            "use_silu": True,
            "in_size": "???",
            "out_size": "???",
            "deterministic": False,
            "propagation_method": "fixed_model"
    # options for training the dynamics model
    "algorithm": {
        "learned_rewards": False,
        "target_is_delta": True,
        "normalize": True,
    # these are experiment specific options
    "overrides": {
        "trial_length": trial_length,
        "num_steps": num_trials * trial_length,
        "model_batch_size": 32,
        "validation_ratio": 0.05
cfg = omegaconf.OmegaConf.create(cfg_dict)#using omegaconf

Using the following two lines of code, we will create a wrapper for 1-D transition reward models.

# Create a 1-D dynamics model for this environment
dynamics_model = common_util.create_one_dim_tr_model(cfg, obs_shape, act_shape)
# Create a gym-like environment to encapsulate the model
model_env = models.ModelEnv(env, dynamics_model, term_fn, reward_fn, generator=generator)

Creating a replay buffer further,

replay_buffer = common_util.create_replay_buffer(cfg, obs_shape, act_shape, rng=rng)

Lets now pass an agent of type planning.RandomAgent to generate the actions,

    trial_length, # initial exploration steps
    {}, # keyword arguments to pass to agent.act()
print("# samples stored", replay_buffer.num_stored)

# samples stored 200

Further configuring our agent deployed,

See Also
New Transformer Variants Keep Flooding The Market, Here’s One From Microsoft Called Fastformer

agent_cfg = omegaconf.OmegaConf.create({

    # Creating a class that evaluates trajectories and picks the best one
    "_target_": "mbrl.planning.TrajectoryOptimizerAgent",
    "planning_horizon": 15,
    "replan_freq": 1,
    "verbose": False,
    "action_lb": "???",
    "action_ub": "???",

    # Defining optimizer to generate and choose a trajectory
    "optimizer_cfg": {
        "_target_": "mbrl.planning.CEMOptimizer",
        "num_iterations": 5,
        "elite_ratio": 0.1,
        "population_size": 500,
        "alpha": 0.1,
        "device": device,
        "lower_bound": "???",
        "upper_bound": "???",
        "return_mean_elites": True

agent = planning.create_trajectory_optim_agent_for_model(
Training The Model 

Now that we have created a model and an agent, we can now run a simple loop, and a few function calls. We will be using a Probabilistic Dynamics Model called PETS. The first code block creates a callback to pass to the model trainer to accumulate the training losses and validation scores observed & the second block is just a utility function to update the agent’s visualization.

train_losses = []
val_scores = []
def train_callback(_model, _total_calls, _epoch, tr_loss, val_score, _best_val):
    val_scores.append(val_score.mean().item())# returns val score per ensemble model
#setting the hyperparameters
def update_axes(_axs, _frame, _text, _trial, _steps_trial, _all_rewards, force_update=False):
    if not force_update and (_steps_trial % 10 != 0):
    _axs[1].set_xlim([0, num_trials + .1])
    _axs[1].set_ylim([0, 200])
    _axs[1].set_ylabel("Trial reward")
    _axs[1].plot(_all_rewards, 'bs-')
    _text.set_text(f"Trial {_trial + 1}: {_steps_trial} steps")

Further creating the training environment, 

# Create a trainer for the model
model_trainer = models.ModelTrainer(dynamics_model, optim_lr=1e-3, weight_decay=5e-5)
# Create visualization objects
fig, axs = plt.subplots(1, 2, figsize=(14, 3.75), gridspec_kw={"width_ratios": [1, 1]})
ax_text = axs[0].text(300, 50, "")
# Main PETS loop
all_rewards = [0]
for trial in range(num_trials):
    obs = env.reset()    
    done = False
    total_reward = 0.0
    steps_trial = 0
    update_axes(axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards)
    while not done:
        if steps_trial == 0:
            dynamics_model.update_normalizer(replay_buffer.get_all())  # update normalizer stats
            dataset_train, dataset_val = common_util.get_basic_buffer_iterators(
                bootstrap_permutes=False,  # build bootstrap dataset using sampling with replacement
        # --- Doing env step using the agent and adding to model dataset ---
        next_obs, reward, done, _ = common_util.step_env_and_add_to_buffer(
            env, obs, agent, {}, replay_buffer)
            axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards)
        obs = next_obs
        total_reward += reward
        steps_trial += 1
        if steps_trial == trial_length:
update_axes(axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards, force_update=True)

Output :

As we can see, we can track the reward received by the model for each number of trials.

Let’s check the training loss and validation score,

fig, ax = plt.subplots(2, 1, figsize=(12, 10))
ax[0].set_xlabel("Total training epochs")
ax[0].set_ylabel("Training loss (avg. NLL)")
ax[1].set_xlabel("Total training epochs")
ax[1].set_ylabel("Validation score (avg. MSE)")

Output :


In this article, we tried to learn about Reinforcement Learning and how it works. We also explored the Model-Based Reinforcement Learning Library, its uses and how it can be used to create a Reinforcement Learning model. The following implementation can be found as a Colab notebook, which can be accessed from the link here

Happy Learning!


What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top