Active Hackathon

# All you need to know about SARSA in Reinforcement Learning

SARSA is one of the reinforcement learning algorithm which learns from the current set os states and actions and learns from the same target policy.

Reinforcement learning is one of the methods of training and validating your data under the principle of actions and rewards under the umbrella of reinforcement learning there are various algorithms and SARSA is one such algorithm of Reinforcement Learning which abbreviates for State Action Reward State Action. So in this article let us try to understand the SARSA algorithm of reinforcement learning.

1. The SARSA algorithm
2. How is SARSA different from the Q-learning algorithm?
3. How to Use SARSA Practically?
4. Analyzing States and Rewards of SARSA through plots
5. Summary

## The SARSA algorithm

State Action Reward State Action (SARSA) is one of the algorithms of reinforcement learning which is a modified version of the Q-learning algorithm. The major point that differentiates the SARSA algorithm from the Q-learning algorithm is that it does not maximize the reward for the next stage of action to be performed and updates the Q-value for the corresponding states.

#### THE BELAMY

Among the two learning policies for the agent, SARSA uses the ON-policy learning technique where the agent learns from the current set of actions performed by the agents. There is no maximum operation that is being performed in the SARSA algorithm which makes it independent from the previous learning or greedy learning policy like the Q-learning algorithm.

So now let us understand how SARSA is different from the Q-learning algorithm.

## How to Use SARSA Practically?

To explore SARSA practically let us design a learning policy for the agent to carry out the actions in each state and receive rewards by following the basic principle of operation of SARSA that it does not consider the maximized rewards obtained from before states and actions. So let us explore how to use SARSA practically that can be used to simulate any gaming applications or optimal solutions.

Let us create a simple SARSA environment with the help of a user-defined function with arguments as follows.

• Environment (env): Argument passed for creating an OpenAI environment
• Number of episodes: Number of iterations of agent to maximize reward
• Learning rate (alpha): Learning rate
• Discount factor: Agents choice to maximize reward
• Epsilon: random actions between 0 to 1

So before creating a user-defined function for SARSA let us create an agent using a user-defined function and declare a certain policy for learning from the different states the algorithm iterates.

Let us first install the required libraries and the official Github repository for reinforcement learning.

```!git clone https://github.com/dennybritz/reinforcement-learning/
%matplotlib inline

import gym
import itertools
import matplotlib
import numpy as np
import pandas as pd
import sys
import lib

if "../" not in sys.path:
sys.path.append("../")

from collections import defaultdict
from lib.envs.windy_gridworld import WindyGridworldEnv
from lib import plotting

matplotlib.style.use('ggplot')```

Now let us create an instance of the SARSA environment.

`env=WindyGridworldEnv()`

Now using this SARSA instance let us create a learning policy for the SARSA algorithm.

```def make_epsilon_greedy_policy(Q, epsilon, nA):  ## Creating a learning policy
def policy_fn(observation):
A = np.ones(nA, dtype=float) * epsilon / nA  ## Number of actions performed
best_action = np.argmax(Q[observation])  ## Maximum reward received is retrieved using argamax
A[best_action] += (1.0 - epsilon)  ## The best reward is subtracted from random actions
return A
return policy_fn```

Using the learning policy let us train the SARSA algorithm on different states and actions and to collect rewards and let us use the collected reward to train the agent for the next state and actions using the user-defined function below.

```def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n)) ## Actions to be taken up by the agent
stats = plotting.EpisodeStats(episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes)) ## Providing the agent states and rewards
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n) ## providing the agent the learning policy

## Creating various paths for the agent
for i_episode in range(num_episodes):
# Print out which episode we're on, useful for debugging.
if (i_episode + 1) % 100 == 0:
print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
sys.stdout.flush()

# Reset the environment and pick the first action
state = env.reset()
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)

# One step in the environment
for t in itertools.count():
next_state, reward, done, _ = env.step(action)  ## Taking a step
next_action_probs = policy(next_state) ## Picking the action
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
stats.episode_rewards[i_episode] += reward ## Collecting the reward received by the agent for the particular state
stats.episode_lengths[i_episode] = t
td_target = reward + discount_factor * Q[next_state][next_action] ## Using discount factor to maximize reward
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta

if done:
break

action = next_action
state = next_state

return Q, stats```

Now as the agent is monitored for the steps and actions and also the rewards received for its action let us train the agent for the required number of iterations.

`Q,stats = sarsa(env, 500)`

As now the SARSA agent is iterated for the required number of iterations let us use the plotting module that is present in the official GitHub repository of reinforcement learning to validate and visualize various statistical measures of the agent like the time taken in a state to perform certain actions and the award received by the agent over different steps taken to receive the awards and other statistical measures.

## Analyzing States and Rewards of SARSA through plots

So the states or the episodes taken by the agent to learn according to the learning policy used can be visualized using the plotting library of the lib module where in the agent’s time for learning through each state and the time is taken to earn the reward can be visualized.

`plotting.plot_episode_stats(stats)`

Let us try to interpret the above plots one by one.

The first plot is the plot that explains the time consumed by the agent to learn over different states over the period of time and the second plot shows that as gradually the agent learns for different states the time taken by the agent reduces significantly and the third plot shows that the time taken for each step increases with increase in the number of episodes for the agent.

## Summary

So this is how the agent operates in the SARSA algorithm to maximize reward in each set of states and actions where the SARSA algorithm is having the ability to operate without the knowledge of previous states and actions. The SARSA algorithm entirely operates on the current learning policy and does not consider any bias in selecting only the State and Action that yielded the maximum reward to move to the next state.

## More Great AIM Stories

### Visualizing and Comparing ML Models Using LazyPredict

Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI

## Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

### DataStax in a crowded NoSQL Market

With Astra Streaming integrated into Astra DB, DataStax delivers an open stack that unifies all aspects of real-time data

### Now Microsoft wants a share of the ‘AI image generator’ pie

Compared to DALL-E, Imagen and Midjourney, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation, says Microsoft

### The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

### Google Street View is in India, but why now?

The pilot project has already reduced 20 per cent wait time for commuters on the road.

### Hold, Pick, Feel: How AI Changes Lives of Amputees

The mind-controlled AI-powered bionic prosthesis is set to change the lives of millions

Data and analytics help reimagine the core business and build a sustainable future. Using data with models of interventions and active knowledge of both the built and natural environment becomes crucial for business and bringing positive change.

### Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world?

### Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

### Tech is turning Astrology into a Billion-dollar industry

In 2021, Google Trends searches for ‘birth chart’ and ‘astrology’ hit five-year peaks, and multiple astrology businesses took off.

### World’s Largest Metaverse nobody is talking about

The US military has partnered with companies like Microsoft, Red 6, Anduril to make virtual world a reality

[class^="wpforms-"]
[class^="wpforms-"]