Now Reading
Using Asynchronous Method For Deep Reinforcement Learning

Using Asynchronous Method For Deep Reinforcement Learning

Abhishek Sharma

Machine Learning applications have propelled artificial intelligence to achieve realistic results to a great extent. This can be largely attributed to improved research and developments in areas like neural networks — particularly deep neural networks. The advancements in these networks have led to other areas of ML, like reinforcement learning (RL), to grow parallelly.

RL gets its inspiration from behavioural psychology, where software entities known as ‘agents’, work together to achieve positive training throughput called as ‘rewards’. Although RL algorithms employ neural networks for functionality, it is found that the algorithms are sometimes unstable when learning data. This issue has been impeding ML researchers for a while, and they have come up with numerous solutions to stabilise RL algorithms in terms of performance.

In this article, we will focus on one specific study by researchers at Google’s Deepmind. It is called the Asynchronous Method for Deep Reinforcement Learning, and also uses gradient descent optimisation technique in ML.

Foundation For Asynchronous Applications

Online RL algorithms rely on data anticipated at the moment. This means rewards are dependent on the action taken on the data; and updates on these algorithms happen in incremental portions. Researchers have worked on improving this process by including a step called experience replay. However, this takes a toll on computing resources such as memory and processing power, in addition to other issues such as data from older RL policy.

In order to resolve this, asynchronous (computing processes which are independent and take place parallelly) methods are developed. Here, the method enables multiple agents to act, instead of relying on multiple instances in RL. It also makes correlation between input and output data easier. In the study mentioned earlier, asynchronous method is applied to typical RL algorithms such as Sarsa (state-action-reward-state-action), n-step Q learning and actor-critic methods. The authors also emphasise on the computing benefits the asynchronous methods provide, by demonstrating that they even work on a standard multi-core CPU instead of the powerful GPU, which is generally used in a deep learning environment.

Asynchronous Method

The study considers the backdrop of the standard RL method for developing asynchronous algorithms. For creating an RL framework, the researchers follow a two-step approach. First, they use a technique called asynchronous actor-learners, due to its robustness. For this, they use a single machine with multiple CPU threads, mainly to lower communication costs between threads and achieve efficient algorithm updates. Second, they analyse the various actor-learners present and use exploration policies on these learners. This provides the advantage of applying online updates parallelly with better correlation in the algorithm. In addition, these parallel multiple actor learners have manifold benefits such as reducing training time and promoting online RL.

Here’s an illustration of the pseudocode written by the authors for each actor-learner thread:

“ //Assume global shared θ, θ, and counter T = 0

Initialize thread step counter t ← 0

Initialize target network weights θ← θ

Initialize network gradients dθ ← 0

Get initial state s


  Take action a with ϵ-greedy policy based on Q(s, a; θ)

  Receive new state s’ and reward r

  y = { r for terminal s’, r + ℽ maxa, Q (s’, a’ ; θ) for non terminal s’}

  Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ))2/∂θ

  s = s’

 T ← T + 1 and t ← t + 1

 if T mod Itarget == 0 then 

 Update the target network θ ← θ

See Also

 end if

 if t mod IAsyncUpdate == 0 or s is terminal then

Perform asynchronous update of θ using dθ.

Clear gradients dθ ← 0.

end if

until T > Tmax “

Now with the help of this pseudocode, asynchronous methods are developed which are based on existing algorithms given below:


  • Asynchronous One-step Q-learning: In this algorithm, each thread references its copy within the computing environment and evaluates gradient in every step of the Q-learning loss. This lowers the chances of overwriting in updates done to the algorithm.
  • Asynchronous One-step Sarsa: This is similar to the above algorithm except it differs in the target value for Q (s, a), which is given by r + ℽ Q(s’,a’, θ) where a’ is action and s’ is the state.
  • Asynchronous n-step Q-learning: This algorithm computes n-step return which is more of a ‘forward view’ algorithm rather than conventional ‘backward view’ programming. For a single update, it uses exploration policy for each state-action and then computes gradients for n-step Q-learning updates for every state-action.
  • Asynchronous Advantage Actor-Critic (A3C): This algorithm follows the same ‘forward view’ approach, except it varies with respect to policy.


These algorithms are tested on various use cases like the Atari 2600 game, and so on. They show a significant reduction in training time (as low as one day in A3C algorithm compared to eight days in deep Q-network). Also, these are computed on CPUs instead of GPUs, and are found to be more stable in terms of performance and learning rates.


Asynchronous method in RL is resource-friendly and can be computed for a small scale learning environment. It shows improved data efficiency and faster responsiveness. Therefore, integrating existing RL algorithms will certainly make it consume lesser resources for computing along with achieving accuracy when it comes to building large neural networks.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
What's Your Reaction?
In Love
Not Sure

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top