MITB Banner

Google’s New Algorithm Increases Deployment Efficiency With Low Costs In RL Algorithms

Share

Recently, developers from Google Research teamed up with the University of Tokyo to introduce Deployment Efficiency and a model-based algorithm known as Behavior-Regularised Model-ENsemble (BREMEN). The algorithm is said to have the capability to optimise an effective policy offline using much lesser data.

Reinforcement Learning is one of the most trending techniques that have been used by a number of domains including robotics, operations research, medicine, autonomous driving and more. The technique has recently gained impressive success in learning behaviours for a number of sequential decision-making tasks.

Behind the Model

According to the researchers, most of the Reinforcement Learning algorithms assume online access to the environment because assuming online access one can interleave updates to the policy with experience collection using that policy. However, doing so, the cost or potential risk of deploying a new data-collection policy is high and it can also become prohibitive to update the data-collection policy more than a few times during learning.

The researchers stated that if a task can be learned with a small number of data collection policies, then the costs, as well as risks, can be substantially reduced during the process. This is the reason behind developing a novel measure of RL algorithm performance, known as Deployment Efficiency. The deployment efficiency counts how many times the data-collection policy has been changed during improvement from random policy to solve the task. 

In order to develop an algorithm that is both sample-efficient and deployment efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. However, using smaller datasets, the model cannot predict properly and results in poorer performance or can be said as extrapolation errors.

In order to better approach these problems arising in limited deployment settings, the researchers further proposed Behavior-Regularized Model-ENsemble (BREMEN). BREMEN learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularising the learned policy via appropriate parameter initialisation and conservative trust-region learning updates.

The BREMEN model incorporates Dyna-style model-based RL, which learns an ensemble of dynamics models in combination with a policy using the imaginary rollouts from the ensemble as well as behaviour regularisation via conservative trust-region updates. 

For the baseline methods, the researchers used the open-source implementations of Soft-Actor-Critic (SAC), BC, BCQ, and Behaviour Regularised Actor-Critic (BRAC). They also used Adam as the optimiser, which is an algorithm for first-order gradient-based optimisation of stochastic objective functions.

Evaluating BREMEN

The researchers evaluated BREMEN on high-dimensional continuous control benchmarks and found out that it achieves impressive deployment efficiency results. The model-based algorithm is able to learn successful policies with only 5-10 deployments, while significantly surpassing the existing off-policy and offline reinforcement learning algorithms in the deployment-constrained setting.

The researchers further evaluated  BREMEN on standard offline Reinforcement Learning benchmarks, where only a single static dataset is used. To this, the researchers found out that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain.

Wrapping Up

In this work, the researchers introduced deployment efficiency, which is a novel measure for reinforcement learning performance that counts the number of changes in the data-collection policy during learning. To enhance the deployment efficiency, they proposed Behavior-Regularised Model-ENsemble (BREMEN), which is a model-based offline algorithm with implicit KL regularisation via appropriate policy initialisation and trust-region updates.

 Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.