# All you need to know about SARSA in Reinforcement Learning

SARSA is one of the reinforcement learning algorithm which learns from the current set os states and actions and learns from the same target policy.

Reinforcement learning is one of the methods of training and validating your data under the principle of actions and rewards under the umbrella of reinforcement learning there are various algorithms and SARSA is one such algorithm of Reinforcement Learning which abbreviates for State Action Reward State Action. So in this article let us try to understand the SARSA algorithm of reinforcement learning.

1. The SARSA algorithm
2. How is SARSA different from the Q-learning algorithm?
3. How to Use SARSA Practically?
4. Analyzing States and Rewards of SARSA through plots
5. Summary

## The SARSA algorithm

State Action Reward State Action (SARSA) is one of the algorithms of reinforcement learning which is a modified version of the Q-learning algorithm. The major point that differentiates the SARSA algorithm from the Q-learning algorithm is that it does not maximize the reward for the next stage of action to be performed and updates the Q-value for the corresponding states.

Among the two learning policies for the agent, SARSA uses the ON-policy learning technique where the agent learns from the current set of actions performed by the agents. There is no maximum operation that is being performed in the SARSA algorithm which makes it independent from the previous learning or greedy learning policy like the Q-learning algorithm.

So now let us understand how SARSA is different from the Q-learning algorithm.

## How to Use SARSA Practically?

To explore SARSA practically let us design a learning policy for the agent to carry out the actions in each state and receive rewards by following the basic principle of operation of SARSA that it does not consider the maximized rewards obtained from before states and actions. So let us explore how to use SARSA practically that can be used to simulate any gaming applications or optimal solutions.

Let us create a simple SARSA environment with the help of a user-defined function with arguments as follows.

• Environment (env): Argument passed for creating an OpenAI environment
• Number of episodes: Number of iterations of agent to maximize reward
• Learning rate (alpha): Learning rate
• Discount factor: Agents choice to maximize reward
• Epsilon: random actions between 0 to 1

So before creating a user-defined function for SARSA let us create an agent using a user-defined function and declare a certain policy for learning from the different states the algorithm iterates.

Let us first install the required libraries and the official Github repository for reinforcement learning.

```!git clone https://github.com/dennybritz/reinforcement-learning/
%matplotlib inline

import gym
import itertools
import matplotlib
import numpy as np
import pandas as pd
import sys
import lib

if "../" not in sys.path:
sys.path.append("../")

from collections import defaultdict
from lib.envs.windy_gridworld import WindyGridworldEnv
from lib import plotting

matplotlib.style.use('ggplot')```

Now let us create an instance of the SARSA environment.

`env=WindyGridworldEnv()`

Now using this SARSA instance let us create a learning policy for the SARSA algorithm.

```def make_epsilon_greedy_policy(Q, epsilon, nA):  ## Creating a learning policy
def policy_fn(observation):
A = np.ones(nA, dtype=float) * epsilon / nA  ## Number of actions performed
best_action = np.argmax(Q[observation])  ## Maximum reward received is retrieved using argamax
A[best_action] += (1.0 - epsilon)  ## The best reward is subtracted from random actions
return A
return policy_fn```

Using the learning policy let us train the SARSA algorithm on different states and actions and to collect rewards and let us use the collected reward to train the agent for the next state and actions using the user-defined function below.

```def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n)) ## Actions to be taken up by the agent
stats = plotting.EpisodeStats(episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes)) ## Providing the agent states and rewards
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n) ## providing the agent the learning policy

## Creating various paths for the agent
for i_episode in range(num_episodes):
# Print out which episode we're on, useful for debugging.
if (i_episode + 1) % 100 == 0:
print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
sys.stdout.flush()

# Reset the environment and pick the first action
state = env.reset()
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)

# One step in the environment
for t in itertools.count():
next_state, reward, done, _ = env.step(action)  ## Taking a step
next_action_probs = policy(next_state) ## Picking the action
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
stats.episode_rewards[i_episode] += reward ## Collecting the reward received by the agent for the particular state
stats.episode_lengths[i_episode] = t
td_target = reward + discount_factor * Q[next_state][next_action] ## Using discount factor to maximize reward
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta

if done:
break

action = next_action
state = next_state

return Q, stats```

Now as the agent is monitored for the steps and actions and also the rewards received for its action let us train the agent for the required number of iterations.

`Q,stats = sarsa(env, 500)`

As now the SARSA agent is iterated for the required number of iterations let us use the plotting module that is present in the official GitHub repository of reinforcement learning to validate and visualize various statistical measures of the agent like the time taken in a state to perform certain actions and the award received by the agent over different steps taken to receive the awards and other statistical measures.

## Analyzing States and Rewards of SARSA through plots

So the states or the episodes taken by the agent to learn according to the learning policy used can be visualized using the plotting library of the lib module where in the agent’s time for learning through each state and the time is taken to earn the reward can be visualized.

`plotting.plot_episode_stats(stats)`

Let us try to interpret the above plots one by one.

The first plot is the plot that explains the time consumed by the agent to learn over different states over the period of time and the second plot shows that as gradually the agent learns for different states the time taken by the agent reduces significantly and the third plot shows that the time taken for each step increases with increase in the number of episodes for the agent.

## Summary

So this is how the agent operates in the SARSA algorithm to maximize reward in each set of states and actions where the SARSA algorithm is having the ability to operate without the knowledge of previous states and actions. The SARSA algorithm entirely operates on the current learning policy and does not consider any bias in selecting only the State and Action that yielded the maximum reward to move to the next state.

## Reference

Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI

## Oct 11-13, 2023 | Bangalore

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### AI Cloud Wars: Azure AI vs Vertex AI

Had you taken a sip everytime Nadella uttered “Copilot” at Microsoft Guild, you would have had woken up with a very bad hangover the next day.

### Is the GPT Breakthrough in Robotics Approaching Soon?

On the heels of AI and LLM advancements, will Robotics get its GPT moment anytime soon?

### You Are to be Blamed for ChatGPT’s Flaws

It is not just ChatGPT’s fault that it generates misinformation, it’s also, and mostly the fault of content creators and media firms

### What is AWS’ Generative AI Strategy?

AWS wants to empower customers with the capabilities and resources to build customised Generative AI solutions that cater to their specific needs and requirements.

### Microsoft To Ape Google’s Success With Windows AI DevTools

The announcements at the Build conference paint a picture of Microsoft’s goals to use the dev ecosystem as rocket fuel to reach for the stars.

### How This Bengaluru Boy Cracked Netflix, Twitter and DoorDash Interviews

“No matter how smart you are, how much knowledge you have or what GPA you score, cracking tech interviews is a different ball game altogether”

### Generative AI is Having An Edison Moment

Companies making generative models accessible are thriving more than a more impactful research

### Unleashing the Power of LLM for Enterprise Applications with LangChain

Listen to this story LangChain has become one of the most talked about topics

### Why You Can Never Imitate ChatGPT

Developers and AI companies have been obsessed with ChatGPT and are trying to build their own version of it, but will never be able to.