Now Reading
Reinforcement Learning That Dreams

Reinforcement Learning That Dreams

Rohit Yadav
RL That Dreams

Ever wonder how we instinctively act on a certain situation when we face danger without the need for a conscious plan of action? This is because we can predict the future course of an instance, even in new situations. We do this by developing a mental model of the world based on what we are able to perceive with limited senses. 

A similar abstract representation is required in artificial intelligence such that it can deliver desired results in uncharted situations. To achieve this, researchers moved away from traditional ways of training reinforcement learning (RL) with a plethora of data to obtain superior results in a close environment. This could only be achieved if RL can clone or duplicate human behaviour of carrying an imaginary world around them in their head.

How Humans Imagination Works That RL Should Replicate

We don’t necessarily require training in a certain situation to make a quick decision in fairly new circumstances. Because we can perceive a state even before it occurs, we don’t always wait for a situation to happen before acting on it, which the traditional RL techniques rely on

For instance, baseball batters have milliseconds to decide how they should swing the bat — shorter than the time it takes for visual signals from our eyes to reach our brain. The reason humans are able to hit a 100mph fastball is due to the ability to instinctively predict the trajectory or when and where the ball will go. For professional players, this happens subconsciously. Players’ muscles reflexively swing the bat at the right time and location in line with their brains’ internal models’ predictions. They can briskly act on their predictions of the future without the need to consciously roll out possible future scenarios to form a plan.

The researchers cite that humans have a predictive model inside of our brains that it might not be about just predicting the future in general, but predicting future sensory data. Consequently, humans can instinctively act on this predictive model and perform fast reflexive behaviours when they face danger without a pre-planned course of action. 


In other words, we dream of a situation. However, the researchers believe that nobody in their head imagines the entire world. One has only selected concepts and relationships between them, which is used to represent the real system.

Researchers Approach To Make RL Dream

One can argue that even RL works on a similar approach as it tries to get maximum rewards. Thus, it predicts the future outcome and changes the way it currently makes decisions. However, this is because it has already been trained in a specific situation. It cannot do the same in a different environment because it cannot imagine the future course when kept in new circumstances.

RL algorithms are often bottlenecked by the credit assignment problem, which makes it tough for classic RL algorithms to learn millions of weights of a large model, hence in practice, smaller and lighter networks are used as they iterate faster to a good policy during training. 

However, large RNN-based agents are effective in delivering desired outcomes. Therefore, the researchers divided the agent into a massive world model and a small controlled model. They first trained a sizable neural network in an unsupervised manner to learn about the world and then trained the smaller model to perform a task using the world model.

The smaller model lets the training algorithm focus on the credit assignment problem on a small search space without sacrificing capacity and expressiveness through the larger world model. By training the agent via the lens of its world model, researchers figure out that the agents can learn a highly compact policy to perform its task. 

The idea is to replace the actual RL environment with a generated one, train agents controller only within the environment generated by its own internal world model, and transfer this policy back into the actual environment.

Agents often exploit the imperfection of the generated environments. Consequently, researchers trained the agent inside a noisier and uncertain version to prevent the agent from taking advantage of the imperfections of an internal world model. 

What Researchers Did

Inspired by the human cognitive system, the agent is equipped with a visual sensory component that squeezes what it sees into a small representative code. Besides, it also has a memory component that makes predictions about future codes based on historical information. Finally, the agent has a decision-making component that decides what actions to take based only on the representation created by its vision and memory components.

All three models — Vision model (V), Memory RNN (M), and Controller (C) — work in tandem to empower the agent in making human-like decisions.

Variational Autoencoder (VAE) model

  • VAE (V) model provides the agent with a 2D image frame as a part of a video sequence 
  • V model compresses each frame it receives into a low dimensional latent vector (zt)
  • The compressed image can be used to reconstruct the original image
  • As discussed earlier, the idea behind feeding a low dimensional image is to train in a noisier environment, so that the agent doesn’t exploit the flaws of the virtual environment

Mixture Density Network – MDN-RNN (M) Model

  • While the V model compresses what the agent sees at each time frame, M compresses what happens over time
  • M model is used to predict the future (z) based on what environment V model produce
  • The RNN was trained to output the probability density function p(z) instead of deterministic prediction of z due to the stochastic nature of complex environments 

Controller (C) Model

  • This model is devoted to determining the course of actions to take in order to maximise the expected cumulative reward of the agent during a rollout of the environment 
  • C model was deliberately ensured that it remains simple and small as possible, and was trained separately from V and M so that most of the complexity residing in the world model (V and M)

C is a single layer linear model that maps zt and ht directly to action at at each time step:

at = Wc [zt ht] + bc 

Where at is the action, Wc and bc are the weight matrix and bias vector that maps the concatenated input vector [zt ht] to the output action vector at.3

Bringing Every Model Together

See Also

The researchers believe that the minimal design for C offers essential practical benefits. This also allows them to train large, sophisticated models effectively, provided the differentiable loss function is well-behaved.

The V and M models were trained effectively with the backpropagation algorithm using modern GPU accelerators, thereby keeping the complexity in the world model V and M. This enabled researchers to explore unconventional ways to train C to tackle challenging RL tasks where the credit assignment problem is difficult.

The Model Was Put Into Car Racing Test

Before the agent was deployed in a racing track to understand how the innovative approach of the world model can help in solving the car racing task, the V model was trained on a dataset of 10,000 random rollouts of the environment. This was then used to pre-process the image frames of the environment to train the M model.

In this, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is only to compress and predict the sequence of image frames observed. However, the C model has access to reward information from the environment. 

Experiment Results

The researchers at first just evaluated the result with V model. While the agent was able to navigate the racing track, it was wobbling and missing the tracks on sharper corners. Then, the full world model (V and M) was evaluated. It was observed that the agent had improved the driving capabilities and was more stable. Besides, it was able to turn in sharp corners effectively.

Since ht contained information about the probability distribution of the future, the agent was able to query the RNN instinctively to guide its action decisions. As the agent was able to model the future, it can come up with hypothetical car racing scenarios of its own — it can dream. Placing C into the dream environment generated by M enables the agent to make human-like decisions.

Future Direction

This approach of creating a world model offers several practical benefits. Firstly, it is a stride towards accomplishing true AI. Secondly, the world model has the potential to eliminate the need for heavy compute resources for different ML initiatives as developers can use the world model directly through the backpropagation algorithm for fine-tuning its policy to maximise an objective function. 

However, the limited capacity of the world model cannot cater to all the needs of developers. A world model cannot record every information inside it, resulting in limiting its use in only a few use cases for a particular world model. Besides, unlike human brains that can hold decades of information, the neural networks trained with backpropagation have limited capacity and suffer from issues such as catastrophic forgetting. Nevertheless, a world model can be created for various instances and deliver environments that will help developers in training agents in a complicated world by dreaming.

Check the Paper by David Ha and Juergen Schmidhuber here.

Provide your comments below


Copyright Analytics India Magazine Pvt Ltd

Scroll To Top