Google Introduces Self-Supervised Reversibility-Aware RL Approach

This approach adds a separate reversibility estimation component to the RL method

Google AI has come out with an approach called Reversibility-Aware RL (reinforcement learning). It adds a separate reversibility estimation component to the RL procedure that is self-supervised. It said that it could be either trained online jointly with the RL or offline from a dataset of interactions. Google claimed that this method increases the performance of RL agents on several tasks, including the Sokoban puzzle game.

Objective of the approach

In the paper titled “There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning“, Google said that with this approach, it aims to formalise the link between reversibility and precedence estimation and show that reversibility can be approximated via temporal order. It also wants to propose a practical algorithm to learn temporal order in a self-supervised way, through simple binary classification using sampled pairs of observations from trajectories and bring out two novel exploration and control strategies that incorporate reversibility and study their practical use for directed exploration and safe RL. 

If the outcome of an action is irreversible, it is basic human nature to take more time to evaluate the outcomes of our actions. The research says that irreversibility can be positive or negative, and decision-makers are more likely to anticipate regret for hard-to-reverse decisions. Through this approach, Google explores the option of using irreversibility to guide decision-making. It wants to prove that safer behaviours come out in environments with intrinsic risk factors by estimating and factoring reversibility in the action selection process. Exploiting reversibility leads to more efficient exploration in environments with undesirable irreversible behaviours.

What is Reversibility-Aware RL

  • It learns about the reversibility component added to the RL procedure from interactions. 
  • The model training is self-supervised. It does not require that the data be labelled with the reversibility of the actions.
  • From the context provided by the training data, the model learns which types of actions are reversible.
  • The theoretical explanation for this is called empirical reversibility. It measures the probability that an event A precedes another event B, knowing that A and B both happen. 

Image: Google (Precedence estimation consists of predicting the temporal order of randomly shuffled events)

The research provides two methods of integrating reversibility in RL:


These stand for Reversibility-Aware Exploration (RAE) and Reversibility-Aware Control (RAC). In RAE, it penalises irreversible transitions through a modified reward function. On the other hand, RAC filters out all irreversible actions. This serves as an intermediate layer between the policy and the environment.

A difference between the two is that RAE does not prohibit reversible actions but encourages them. Google says that RAE is better suited for tasks where it is suspected that irreversible actions are to be avoided the majority of the time.

Image: Google

Let’s evaluate

Scenario 1

The research introduces a synthetic environment where an agent in an open field is tasked with reaching a goal. The environment remains unchanged if the agent follows the established pathway. Else, if the pathway is not followed, the path it takes turns brown. Though this leads to a change in the environment, there is no penalty for this.


A model-free agent like Proximal Policy Optimization (PPO) agent follows the shortest path on average and spoils some of the grass, but a PPO+RAE agent stays away from all irreversible side-effects.

Scenario 2

In the Cartpole task, where the agent controls a cart to balance a pole standing upright on top of it, the research sets the maximum number of interactions to 50k steps. Here, irreversible actions make the pole fall. The research shows that combining RAC with any RL agent does not fail if it is given that an appropriate threshold for the probability that action is irreversible is selected.

Image: Google (Cartpole performance of a random policy equipped with RAC evolves with different threshold values (ꞵ). Standard model-free agents (DQN, M-DQN) typically score less than 3,000, compared to 50,000 (the maximum score) for an agent governed by a random+RAC policy at a threshold value of β=0.4.)

How it works on the Sokoban puzzle

In Sokoban, the player pushes boxes around in target spaces while avoiding unrecoverable situations. In a standard RL model, early iterations of the agent typically act in a near-random fashion to explore the environment. It gets stuck very often and fails to solve the puzzle.

The research compares the performance in the Sokoban environment of IMPALA to that of an IMPALA+RAE agent. It finds that the agent with the combined IMPALA+RAE policy is deadlocked less frequently.

Image: Google (The scores of IMPALA and IMPALA+RAE on a set of 1,000 Sokoban levels. A new level is sampled at the beginning of each episode. The best score is level-dependent and close to 10.)

Larger impact

The research says that this approach has applications in safety-first scenarios, where irreversible behaviour or side-effects must be avoided. The implications of this would mean safer interactions with RL-powered components like robots, virtual assistants, and recommender systems which could become the norm in the future.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

More Stories


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM