SARSA: A Guide to the On-Policy Reinforcement Learning Algorithm

On Policy Reinforcement Learning SARSA

Reinforcement Learning is a fascinating machine learning technique, which differs from the other techniques of ML. In reinforcement learning, the actor(Agent) has to continuously interact with the environment to make decisions, which results in some sort of feedback. In a way, the agent is self-taught and learns about the environment by exploring.

But to explore the environment, the agent has to follow a strategy(or not), that guides the agent to make optimal decisions and maximize the cumulative reward. Such algorithms are called on-policy algorithms.

SARSA(State-Action-Reward-State-Action) is an on-policy algorithm that works iteratively, to help the agent find the optimal path and maximize the rewards.

In practical scenarios, SARSA is applied in areas like autonomous navigation, where it aids in real-time decision-making, and in-game AI, where it helps in developing more adaptive and challenging gameplay.

Also check: A Beginner’s Guide to Reinforcement Learning

Introduction to SARSA

SARSA and Q Learning are distinct reinforcement learning algorithms, each with a unique approach to handling state-action pairs. SARSA is an on-policy technique that works iteratively with the initial policy, refining it based on the rewards obtained. The abbreviation SARSA stands for State-Action-Reward-State-Action.

In an on-policy algorithm, the behavior policy is the same as the target policy. Behavior policy refers to the policy the agent follows to explore the environment. Target policy refers to the policy the agent follows to reach the target.

SARSA is a model-free learning algorithm, which means it interacts with the environment but doesn’t require knowledge about the transitions in the environment to make decisions. The SARSA learning algorithm is iterative, as it continues the process till the agent has explored the optimal states or has reached the goal state.

It keeps track of the current state and actions, and the expected reward that the agent receives, also keeping track of the next possible state the agent has to explore. Based on the reward obtained, the algorithm keeps updating the policy.

SARSA Algorithm

To understand the algorithm, we first need to familiarize ourselves with a few important terms.

  • Q Values: The estimated cumulative reward awarded to the agent for acting in a particular state
  • Q Table: The Q Table keeps track of the Q Values of the states and actions to take the next optimal decision
  • Exploration-Exploitation dilemma: During training, the agent experiences a dilemma whether to explore new states to gain new knowledge(exploration) or to use the existing knowledge about the environment(exploitation)

The algorithm sets the values in the Q table to arbitrary values(zero) before initialization and uses an epsilon greedy approach to select an initial state and the corresponding action.

SARSA Algorithm
SARSA Algorithm

The Q-values are initialized arbitrarily in the initialization of the algorithm. Then a state is chosen, and the action a is selected from the state as specified by the policy. After the action is taken, reward r is obtained by the agent. Next, the agent has to choose the next action a’ from the next state s’. The learning rule is used to update the Q values.

To balance the exploration of new states and the exploitation of existing states, SARSA uses a strategy called the ε-greedy strategy. The ε-greedy strategy balances exploration, where the agent selects a random action with probability ε, and exploitation, choosing the best-known action with probability 1-ε. And with a probability of 1-ε, it exploits the known states. Given the Q values of all the actions in a state, the agent selects the action with the highest Q value.

SARSA Learning Rule

Let us take a look at the learning/update rule with which the algorithm moves forward and also its components.

The learning rule is given below.

Learning rule
Learning rule

The SARSA learning rule, Q(St, At) = Q(St, At) + α[R + γQ(St+1, At+1) – Q(St, At)], updates the Q-value of the current state-action pair by considering the immediate reward plus the discounted Q-value of the next state-action pair. For instance, if an agent moves to a new state with a reward of 5 and an estimated future reward of 10, the Q-value is adjusted accordingly.

Temporal difference learning gives a relation between the current estimate value function and the next optimal value function. The terms of the learning rule are explained below.

  • Q(St, At): The Q Value of the present state-action pair
  • Alpha(α): The learning rate of the algorithm
  • R: The reward expected at the state
  • γ: The discount factor which is used to provide weightage to the future rewards

Challenges of SARSA

In addition to the exploration-exploitation dilemma, SARSA also faces a few other challenges:

Converging to Suboptimal policies: The algorithm may converge to suboptimal policies after a certain point, and might get stuck in local minima.

Limited Memory: This algorithm faces memory issues when interacting with an environment that has large state spaces.

SARSA’s on-policy nature allows for more stable convergence properties, particularly in avoiding local minima, by continuously updating its policy based on both current and next actions.

Delayed rewards: It may experience delayed rewards as it prioritizes short-term gains.


SARSA is almost similar to Q Learning, with slight differences in how the algorithm deals with future values. It is a model-free learning just like Q Learning, but is an on-policy algorithm. In this post, we have discussed its prerequisites, the working of the algorithm, and the learning rule it uses to update the policy. In addition to that, we have also discussed the challenges related to this algorithm