Recently, researchers from DeepMind and Google introduced methods for choosing the best policy in offline reinforcement learning (ORL) known as offline hyperparameter selection (OHS). It uses logged data from a set of many policies that are trained using different hyperparameters.
Reinforcement learning has become one of the most critical techniques in AI which has been used to attain Artificial General Intelligence. Offline reinforcement learning has now become a fundamental approach for deploying RL techniques in real-world scenarios.
According to this blog post, offline reinforcement learning can assist in pre-training a reinforcement learning agent using the existing data. It can empirically evaluate RL algorithms based on their ability to utilise a fixed dataset of interactions as well as deliver real-world impact.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Behind the Offline Hyperparameter Selection
Enabling the application of reinforcement learning methods in real-world scenarios where only logged data can be used is one of the fundamental goals of offline reinforcement learning.
According to the researchers, offline hyperparameter selection (OHS) is closely related to the offline policy evaluation (OPE), which focuses on estimating a value function based on offline data.
Download our Mobile App
Hyperparameter selection refers to picking the best out of a set of given policies that were trained with different hyperparameters, whereas tuning includes both selection and a strategy for searching the hyperparameter space.
While performing the offline hyperparameter selection, one of the challenges faced by the researchers is to rank several policies using statistics computed solely from offline data. Thus, to apply offline hyperparameter selection in practice, the researchers followed the following workflow:
- Train offline reinforcement learning policies using several different hyperparameter settings.
- For each policy, compute scalar statistics summarising the policy’s performance, without any interaction with the environment.
- Lastly, pick the top best policies according to the summary statistics to execute in the real environment.
The researchers used the evaluation metrics that aim to capture how useful different statistics are for ranking multiple policies and selecting the best. They further identified three crucial choices that can affect how valid offline hyperparameter selection can be:
- The Choice of Offline RL Algorithm: It is found that algorithms that encourage policies to stay close to the behaviour policy are easier to evaluate and rank.
- The Choice of Q Estimator: Using the OPE algorithm, the researchers found the estimated Q values, known as the Fitted Q-Evaluation.
- The Choice of Statistics for Summarising the Quality of a Policy: It found that the average critical value of the initial states works better than alternatives.
Contributions by the Researchers
The contributions made by the researchers are:
- Through this research, they presented a thorough empirical study of offline hyperparameter selection for offline reinforcement learning.
- The researchers used simple and scalable evaluation metrics to assess the merit of different approaches for offline RL hyperparameter selection.
- They used challenging domains which require high-dimensional action spaces, high dimensional observation spaces, and long time horizons.
- The research is focussed on important common hyperparameters, which include model architecture, optimiser parameters, and loss function.
According to the researchers, the outcome of this research presented an optimistic view that offline hyperparameter selection (OHS) is within capacity, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizons.
Through large-scale empirical evaluation, the researchers showed that:
- Offline reinforcement learning algorithms are not robust to hyperparameter choices.
- Factors such as the offline RL algorithm and the method for estimating the Q values can have a high impact on hyperparameter selection.
- Controlling those factors carefully, one can presumably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set.
Read the paper here.