On-Policy VS Off-Policy Reinforcement Learning

A reinforcement learning system consists of four main elements:

  1. An agent
  2. A policy 
  3. A reward signal, and 
  4. A value function

An agent’s behaviour at any point of time is defined in terms of a policy. A policy is like a blueprint of the connections between perception and action in an environment.  

In the next section, we shall talk about the key differences in the two main kind of policies: /

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  • On-policy reinforcement learning
  • Off-policy reinforcement learning

On-Policy VS Off-Policy

Comparing reinforcement learning models for hyperparameter optimization is an expensive affair, and often practically infeasible. So the performance of these algorithms is evaluated via on-policy interactions with the target environment. These interactions of an on-policy learner help get insights about the kind of policy that the agent is implementing.

An off-policy, whereas, is independent of the agent’s actions. It figures out the optimal policy regardless of the agent’s motivation. For example, Q-learning is an off-policy learner.

On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data.

Here is a snippet from Richard Sutton’s book on reinforcement learning where he discusses the off-policy and on-policy with regard to Q-learning and SARSA respectively:


In Q-Learning, the agent learns optimal policy with the help of a greedy policy and behaves using policies of other agents. Q-learning is called off-policy because the updated policy is different from the behavior policy, so Q-Learning is off-policy. In other words, it estimates the reward for future actions and appends a value to the new state without actually following any greedy policy.


SARSA (state-action-reward-state-action) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. In this algorithm, the agent grasps the optimal policy and uses the same to act. The policy that is used for updating and the policy used for acting is the same, unlike in Q-learning. This is an example of on-policy learning.

An experience in SARSA is of the form ⟨S,A,R,S’, A’ ⟩, which means that

  • current state S, 
  • current action A, 
  • reward R, and 
  • new state S’,
  • future action A’. 

This provides a new experience to update from

Q(S,A) to R+γQ(S’,A’).


On-policy reinforcement learning is useful when you want to optimize the value of an agent that is exploring. For offline learning, where the agent does not explore much, off-policy RL may be more appropriate.

For instance, off-policy classification is good at predicting movement in robotics. Off-policy learning can be very cost-effective when it comes to deployment in real-world, reinforcement learning scenarios. The characteristic of the agent to explore and find new ways and cater for the future rewards task makes it a suitable candidate for flexible operations. Imagine a robotic arm that has been tasked to paint something other than what it is trained on. Physical systems need such flexibility to be smart and reliable. You do not want to hardcode use cases today. The goal is to learn on the go.

However, off-policy frameworks too are not without any disadvantages. Evaluation becomes challenging as there is too much exploration. These algorithms might assume that an off-policy evaluation method is accurate in assessing the performance. But agents fed with past experiences may act very differently from newer learned agents, which makes it hard to get good estimates of performance. 

Promising directions for future work include developing off-policy methods that are not restricted to success or failure of reward tasks, but extending the analysis to stochastic tasks as well.

For more information, check Richard Sutton’s book.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.