DeepMind & Its Parent Company Google Are Betting Big On Reinforcement Learning

Recently, researchers from DeepMind and Google introduced methods for choosing the best policy in offline reinforcement learning (ORL) known as offline hyperparameter selection (OHS). It uses logged data from a set of many policies that are trained using different hyperparameters. 

Reinforcement learning has become one of the most critical techniques in AI which has been used to attain Artificial General Intelligence. Offline reinforcement learning has now become a fundamental approach for deploying RL techniques in real-world scenarios. 

According to this blog post, offline reinforcement learning can assist in pre-training a reinforcement learning agent using the existing data. It can empirically evaluate RL algorithms based on their ability to utilise a fixed dataset of interactions as well as deliver real-world impact.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Behind the Offline Hyperparameter Selection

Enabling the application of reinforcement learning methods in real-world scenarios where only logged data can be used is one of the fundamental goals of offline reinforcement learning.

According to the researchers, offline hyperparameter selection (OHS) is closely related to the offline policy evaluation (OPE), which focuses on estimating a value function based on offline data.


Download our Mobile App



Hyperparameter selection refers to picking the best out of a set of given policies that were trained with different hyperparameters, whereas tuning includes both selection and a strategy for searching the hyperparameter space.

While performing the offline hyperparameter selection, one of the challenges faced by the researchers is to rank several policies using statistics computed solely from offline data. Thus,  to apply offline hyperparameter selection in practice, the researchers followed the following workflow:

  • Train offline reinforcement learning policies using several different hyperparameter settings.
  • For each policy, compute scalar statistics summarising the policy’s performance, without any interaction with the environment.
  • Lastly, pick the top best policies according to the summary statistics to execute in the real environment. 

The researchers used the evaluation metrics that aim to capture how useful different statistics are for ranking multiple policies and selecting the best. They further identified three crucial choices that can affect how valid offline hyperparameter selection can be:

  • The Choice of Offline RL Algorithm: It is found that algorithms that encourage policies to stay close to the behaviour policy are easier to evaluate and rank.
  • The Choice of Q Estimator: Using the OPE algorithm, the researchers found the estimated Q values, known as the Fitted Q-Evaluation.
  • The Choice of Statistics for Summarising the Quality of a Policy: It found that the average critical value of the initial states works better than alternatives. 

Contributions by the Researchers

The contributions made by the researchers are:

  • Through this research, they presented a thorough empirical study of offline hyperparameter selection for offline reinforcement learning.
  • The researchers used simple and scalable evaluation metrics to assess the merit of different approaches for offline RL hyperparameter selection. 
  • They used challenging domains which require high-dimensional action spaces, high dimensional observation spaces, and long time horizons.
  • The research is focussed on important common hyperparameters, which include model architecture, optimiser parameters, and loss function.

Wrapping Up

According to the researchers, the outcome of this research presented an optimistic view that offline hyperparameter selection (OHS) is within capacity, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizons.

Through large-scale empirical evaluation, the researchers showed that:

  • Offline reinforcement learning algorithms are not robust to hyperparameter choices.
  • Factors such as the offline RL algorithm and the method for estimating the Q values can have a high impact on hyperparameter selection.
  • Controlling those factors carefully, one can presumably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set.

Read the paper here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.