 # All you need to know about SARSA in Reinforcement Learning

SARSA is one of the reinforcement learning algorithm which learns from the current set os states and actions and learns from the same target policy. Reinforcement learning is one of the methods of training and validating your data under the principle of actions and rewards under the umbrella of reinforcement learning there are various algorithms and SARSA is one such algorithm of Reinforcement Learning which abbreviates for State Action Reward State Action. So in this article let us try to understand the SARSA algorithm of reinforcement learning.

1. The SARSA algorithm
2. How is SARSA different from the Q-learning algorithm?
3. How to Use SARSA Practically?
4. Analyzing States and Rewards of SARSA through plots
5. Summary

## The SARSA algorithm

State Action Reward State Action (SARSA) is one of the algorithms of reinforcement learning which is a modified version of the Q-learning algorithm. The major point that differentiates the SARSA algorithm from the Q-learning algorithm is that it does not maximize the reward for the next stage of action to be performed and updates the Q-value for the corresponding states.

#### THE BELAMY

Among the two learning policies for the agent, SARSA uses the ON-policy learning technique where the agent learns from the current set of actions performed by the agents. There is no maximum operation that is being performed in the SARSA algorithm which makes it independent from the previous learning or greedy learning policy like the Q-learning algorithm.

So now let us understand how SARSA is different from the Q-learning algorithm.

## How to Use SARSA Practically?

To explore SARSA practically let us design a learning policy for the agent to carry out the actions in each state and receive rewards by following the basic principle of operation of SARSA that it does not consider the maximized rewards obtained from before states and actions. So let us explore how to use SARSA practically that can be used to simulate any gaming applications or optimal solutions.

Let us create a simple SARSA environment with the help of a user-defined function with arguments as follows.

• Environment (env): Argument passed for creating an OpenAI environment
• Number of episodes: Number of iterations of agent to maximize reward
• Learning rate (alpha): Learning rate
• Discount factor: Agents choice to maximize reward
• Epsilon: random actions between 0 to 1

So before creating a user-defined function for SARSA let us create an agent using a user-defined function and declare a certain policy for learning from the different states the algorithm iterates.

Let us first install the required libraries and the official Github repository for reinforcement learning.

```!git clone https://github.com/dennybritz/reinforcement-learning/
%matplotlib inline

import gym
import itertools
import matplotlib
import numpy as np
import pandas as pd
import sys
import lib

if "../" not in sys.path:
sys.path.append("../")

from collections import defaultdict
from lib.envs.windy_gridworld import WindyGridworldEnv
from lib import plotting

matplotlib.style.use('ggplot')```

Now let us create an instance of the SARSA environment.

`env=WindyGridworldEnv()`

Now using this SARSA instance let us create a learning policy for the SARSA algorithm.

```def make_epsilon_greedy_policy(Q, epsilon, nA):  ## Creating a learning policy
def policy_fn(observation):
A = np.ones(nA, dtype=float) * epsilon / nA  ## Number of actions performed
best_action = np.argmax(Q[observation])  ## Maximum reward received is retrieved using argamax
A[best_action] += (1.0 - epsilon)  ## The best reward is subtracted from random actions
return A
return policy_fn```

Using the learning policy let us train the SARSA algorithm on different states and actions and to collect rewards and let us use the collected reward to train the agent for the next state and actions using the user-defined function below.

```def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n)) ## Actions to be taken up by the agent
stats = plotting.EpisodeStats(episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes)) ## Providing the agent states and rewards
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n) ## providing the agent the learning policy

## Creating various paths for the agent
for i_episode in range(num_episodes):
# Print out which episode we're on, useful for debugging.
if (i_episode + 1) % 100 == 0:
print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
sys.stdout.flush()

# Reset the environment and pick the first action
state = env.reset()
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)

# One step in the environment
for t in itertools.count():
next_state, reward, done, _ = env.step(action)  ## Taking a step
next_action_probs = policy(next_state) ## Picking the action
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
stats.episode_rewards[i_episode] += reward ## Collecting the reward received by the agent for the particular state
stats.episode_lengths[i_episode] = t
td_target = reward + discount_factor * Q[next_state][next_action] ## Using discount factor to maximize reward
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta

if done:
break

action = next_action
state = next_state

return Q, stats```

Now as the agent is monitored for the steps and actions and also the rewards received for its action let us train the agent for the required number of iterations.

`Q,stats = sarsa(env, 500)`

As now the SARSA agent is iterated for the required number of iterations let us use the plotting module that is present in the official GitHub repository of reinforcement learning to validate and visualize various statistical measures of the agent like the time taken in a state to perform certain actions and the award received by the agent over different steps taken to receive the awards and other statistical measures.

## Analyzing States and Rewards of SARSA through plots

So the states or the episodes taken by the agent to learn according to the learning policy used can be visualized using the plotting library of the lib module where in the agent’s time for learning through each state and the time is taken to earn the reward can be visualized.

`plotting.plot_episode_stats(stats)`

Let us try to interpret the above plots one by one.

The first plot is the plot that explains the time consumed by the agent to learn over different states over the period of time and the second plot shows that as gradually the agent learns for different states the time taken by the agent reduces significantly and the third plot shows that the time taken for each step increases with increase in the number of episodes for the agent.

## Summary

So this is how the agent operates in the SARSA algorithm to maximize reward in each set of states and actions where the SARSA algorithm is having the ability to operate without the knowledge of previous states and actions. The SARSA algorithm entirely operates on the current learning policy and does not consider any bias in selecting only the State and Action that yielded the maximum reward to move to the next state.

## More Great AIM Stories

### Visualizing and Comparing ML Models Using LazyPredict Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI

## Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more. 