Search

Top Evaluation Metrics For Reinforcement Learning

Deep reinforcement learning algorithms are considerably sensitive to implementation details, hyper-parameters, choice of environments, and even random seeds. The variability in the execution can put reproducibility at stake.

To expose the underlying weaknesses of the RL models, the researchers at Google research presented a number of metrics and also demonstrated with the help of these metrics that strengths and weaknesses of an algorithm are obscured when we only inspect the mean or median performance. In the next section, we discuss these metrics in brief.

Before getting into the definitions of those new metrics, here are some important terms to knows:

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
• Dispersion: In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched. Measuring a statistical dispersion gives zero if all the data are the same and increases as data become more diverse.
• Interquartile Range (IQR): The data distribution is usually divided into 4 equal parts; they are denoted by Q1, Q2, and Q3, respectively. IQR is the difference between the upper and lower quartiles.

IQR = Q3 – Q1

• CVaR: Conditional Value at Risk or CVaR is derived by taking a weighted average of the “extreme” losses in the tail of the distribution of possible outcomes.

Dispersion across Time (DT): IQR across Time

Dispersion across Time (DT) is measured by isolating higher-frequency variability instead of relying on longer-term trends. To avoid the metrics from getting influenced by a positive trend, the authors applied detrending. Detrending is a statistical technique that involves removing the effects of accumulating data sets from a trend to show only the absolute changes in values and to identify potential repeating patterns. The final measure of DT consists of an interquartile range (IQR) within a sliding window along the detrended training curve.

Short-term Risk across Time (SRT): CVaR on Differences

This metric gives the worst-case expected drop in performance during training, from one point of evaluation to the next. To do this, CVaR is applied to the changes in performance from one evaluation point to the next. SRT is calculated as follows:

• Compute the differences between two-time points on each training run
• Normalise the differences by the distance between time-points to ensure invariance to evaluation frequency
• Obtain the distribution of these differences and find the α-quantile
• Compute the expected value of the distribution below the α-quantile

Long-term Risk across Time (LRT)

This metric helps in monitoring the performance relative to the highest peak so far and can be used to capture unusually large drops that occur over longer timescales (drawdown). For this measure, CVaR is applied to the drawdown time.

Dispersion across Runs (DR)

It is measured by taking the variance or standard deviation across training runs at a set of evaluation points. First, low-pass filtering is performed on the training data to filter out high-frequency variability within runs. The variance or standard deviation is replaced with IQR.

Risk across Runs (RR)

CVaR is applied to the final performance of all the training runs. Using this metric gives an idea of the performance of the worst runs.

Dispersion across Fixed-Policy Rollouts (DF)

To compute this metric, the IQR is calculated on the performance of the rollouts. This helps in evaluating a fixed policy for checking the variability in performance when the same policy is rolled out multiple times.

Risk across Fixed-Policy Rollouts (RF)

This metric is similar to Dispersion across Fixed-Policy Rollouts except that CVaR is applied on the rollout performances.

All these metrics are available as a library named RL Reliability Metrics. Check here.

I have a master's degree in Robotics and I write about machine learning advancements.

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Data Science Hiring Process at Pegasystems

For data science roles, Pega focuses on the candidate’s ability to learn and adapt rather

Alibaba’s latest Qwen model can save its failing cloud business.

Genpact, AWS Collaborate to Revolutionise Insurance Claims Lifecycle

Genpact also announced an expanded collaboration with AWS aimed at revolutionising financial crime risk operations

7 Bizarre Things About ChatGPT You Wish You Knew

Ever wondered why it’s called ChatGPT?

China Open Sources DeepSeek LLM, Outperforms Llama 2 and Claude-2

DeepSeek LLM 7B/67B models, including base and chat versions, are released to the public on

Apple’s Scary New Innovation Gives Voice to the Voiceless

Apple’s latest innovation, Personal Voice, unveiled just before the International Day of Persons with Disabilities,

9 Must-Know Open Source Models From Meta in 2023

Meta has been synonymous with open source ecosystems. Recently, its research arm, FAIR, completed 10

AI Assists Production in Indian Film Industry

Implementing AI in pre-production can bring down storyboarding process time by 50-80% and reduce the

Is GPT-4 Really Better than Radiologists?

“Radiology report summaries created by GPT-4 are comparable, and in some cases, even preferred over

TSMC: The Wizard Behind AI’s Curtain

TSMC anticipates a substantial CAGR of nearly 50% in the AI sector from 2022 to 2027.