Deep reinforcement learning algorithms are considerably sensitive to implementation details, hyper-parameters, choice of environments, and even random seeds. The variability in the execution can put reproducibility at stake.
To expose the underlying weaknesses of the RL models, the researchers at Google research presented a number of metrics and also demonstrated with the help of these metrics that strengths and weaknesses of an algorithm are obscured when we only inspect the mean or median performance. In the next section, we discuss these metrics in brief.
Before getting into the definitions of those new metrics, here are some important terms to knows:
- Dispersion: In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched. Measuring a statistical dispersion gives zero if all the data are the same and increases as data become more diverse.
- Interquartile Range (IQR): The data distribution is usually divided into 4 equal parts; they are denoted by Q1, Q2, and Q3, respectively. IQR is the difference between the upper and lower quartiles.
IQR = Q3 – Q1
- CVaR: Conditional Value at Risk or CVaR is derived by taking a weighted average of the “extreme” losses in the tail of the distribution of possible outcomes.
Dispersion across Time (DT): IQR across Time
Dispersion across Time (DT) is measured by isolating higher-frequency variability instead of relying on longer-term trends. To avoid the metrics from getting influenced by a positive trend, the authors applied detrending. Detrending is a statistical technique that involves removing the effects of accumulating data sets from a trend to show only the absolute changes in values and to identify potential repeating patterns. The final measure of DT consists of an interquartile range (IQR) within a sliding window along the detrended training curve.
Short-term Risk across Time (SRT): CVaR on Differences
This metric gives the worst-case expected drop in performance during training, from one point of evaluation to the next. To do this, CVaR is applied to the changes in performance from one evaluation point to the next. SRT is calculated as follows:
- Compute the differences between two-time points on each training run
- Normalise the differences by the distance between time-points to ensure invariance to evaluation frequency
- Obtain the distribution of these differences and find the α-quantile
- Compute the expected value of the distribution below the α-quantile
Long-term Risk across Time (LRT)
This metric helps in monitoring the performance relative to the highest peak so far and can be used to capture unusually large drops that occur over longer timescales (drawdown). For this measure, CVaR is applied to the drawdown time.
Dispersion across Runs (DR)
It is measured by taking the variance or standard deviation across training runs at a set of evaluation points. First, low-pass filtering is performed on the training data to filter out high-frequency variability within runs. The variance or standard deviation is replaced with IQR.
Risk across Runs (RR)
CVaR is applied to the final performance of all the training runs. Using this metric gives an idea of the performance of the worst runs.
Dispersion across Fixed-Policy Rollouts (DF)
To compute this metric, the IQR is calculated on the performance of the rollouts. This helps in evaluating a fixed policy for checking the variability in performance when the same policy is rolled out multiple times.
Risk across Fixed-Policy Rollouts (RF)
This metric is similar to Dispersion across Fixed-Policy Rollouts except that CVaR is applied on the rollout performances.
All these metrics are available as a library named RL Reliability Metrics. Check here.