Q Learning vs SARSA: Key Differences in Reinforcement Learning

Reinforcement Learning is a branch of machine learning that focuses on decision-making to maximize positive outcomes while successfully reaching a goal state. It involves performing certain actions through a trial-and-error method.

Reinforcement Learning is used in many real-world applications like Gaming, Recommendation systems, autonomous vehicles, and many more.

The agent is expected to explore the environment they are in and perform certain actions in each state to earn some rewards or feedback.

Reinforcement learning is a vast concept and is a bit critical to understand. Two search algorithms of reinforcement learning are Q Learning and SARSA, which aid the agent in making optimal decisions.

We have already discussed the concepts of reinforcement learning, Q Learning, and SARSA in the previous posts. The objective of this article is to compare the two algorithms, in terms of similarities and differences between them.

Q Learning and SARSA are both model-free, temporal difference learning algorithms in reinforcement learning. They differ in policy adherence: Q Learning is off-policy, choosing actions greedily for faster optimal policy convergence, whereas SARSA is on-policy, following set actions more gradually.

Similarities Between Q Learning and SARSA

Let us discuss the similarities between Q Learning and SARSA one by one.

Model-Free Learning

Both Q Learning and SARSA are model-free learning algorithms, which means that they need not know the transitions or the dynamics of the environment they are currently in. They directly interact with the environment. Model-free learning also means that the algorithm need not consider the Markov decision process(MDP) and the transition probabilities of the problem the agent is in.

Temporal Difference Learning

Q Learning and SARSA are often referred to as temporal difference learning algorithms. Temporal difference learning(TD) resembles dynamic programming as TD bootstraps the current estimate based on future estimates. Both Q Learning and SARSA update their Q values based on current and future reward estimates. This method updates predictions based on the difference between consecutive estimates, rather than waiting for an outcome.

Bellman Equation

The Bellman equation plays an important role in the policy updation of the Q Learning and SARSA algorithms. The learning rules of Q Learning and SARSA are updated based on the Bellman equation which is given by:

Bellman Equation

The learning rules of Q Learning and SARSA are shown below:

Similarities Between Q Learning And SARSA - Update Rule — Similarities Between Q Learning And SARSA – Update Rule

How do the two algorithms resemble the Bellman equation? Well if observe the learning rules, we find a term – Q(s,a’) and Q(s’,a’), which resembles the term r(n)+f(n+1) of the Bellman equation.

Goal

Both Q Learning and SARSA work towards the same goal- maximizing rewards by following an optimal policy.

Exploration- Exploitation Trade-off

Both Q Learning and SARSA address the exploration-exploitation trade-off which refers to the dilemma the agent experiences whether to explore the new states in the environment to gain new knowledge or to exploit the known information.

ε-Greedy Strategy

To overcome the exploration-exploitation dilemma, these algorithms use something called a ε-greedy strategy. The two algorithms explore the new states with a probability of ε and exploit the known actions with a probability of 1-ε.

Distinguishing Features: Q Learning vs SARSA

Let us take a look at the key differences between Q Learning and SARSA.

Off-Policy vs On-Policy

The main difference between Q Learning and SARSA is how they follow the initial policy specified. While Q Learning is off-policy, the SARSA learning algorithm is on-policy. Off-policy learning means that the algorithm doesn’t follow the initial policy specified, and rather uses some other optimal policy or no policy at all. On-policy means that the algorithm follows the same policy until the goal is reached.

Update/Learning Rule

Q Learning: The Q Learning algorithm updates the Q value of the current state based on the maximum Q value of the next state. It follows a greedy approach in selecting the next action in the future state.

SARSA: Contrary to Q Learning, the SARSA algorithm updates the Q values based on the action taken in the future state. It does not look for an action that maximizes the value but follows on-policy and goes on with the action the agent takes based on the reward in the current state.

Back-up Diagrams

Let us take a look at the backup diagrams of Q Learning and SARSA to understand the learning rule better.

As observed from the above backup diagrams, we can see that in Q Learning, given the next state, a maximum rewarding action is considered. Whereas in SARSA, given the future state, the action is taken without being greedy and as per the policy.

Converging to Optimal Policy

Given their on-policy and off-policy commitments, these algorithms also differ in how fast they converge to an optimal policy. Since Q Learning is an off-policy, it tends to converge to an optimal policy, more quickly than SARSA. Contrary to this, SARSA converges to an optimal policy slowly given its on-policy nature.

Usage of ε-greedy Strategy

O’Reilly has explained the main difference between the two with a very useful flow chart of the algorithms.

The image is self-explanatory. In Q Learning, we use a greedy approach for choosing the action in a particular state, and the learning rule is updated right after. In SARSA, the next action is chosen according to the policy and then the learning rule is updated.

Expected SARSA – An Extension

Expected SARSA can be called as an extension to both the algorithms, as it can be an on-policy and off-policy algorithm based on the requirements.

The learning rule is given below.

Summary

We have discussed the similarities and differences between the two algorithms in this tutorial.

Even though Q Learning and SARSA are closely related in terms of the model-free learning approach and the TD Learning technique, the resemblance with the Bellman equation, there are a few properties that make them different. Q Learning is an off-policy algorithm, therefore it converges to an optimal strategy faster than the SARSA algorithm.

References

Q Learning

SARSA