Introduction to Q-Learning: A Comprehensive Guide

Introduction To Q Learning

Reinforcement Learning is a type of machine learning technique where an agent learns to make decisions based on its surroundings by performing some actions in the environment. For every action it performs, it receives a reward or a penalty. The agent’s goal is to maximize positive rewards by making optimal decisions.

Q Learning is a reinforcement learning algorithm that guides the agent by searching for the next action to take, which will maximize the reward given the current state of the agent.

In this post, we are going to talk about the concepts of Q Learning.

Recommended Read: Using Python for game building

Pre-requisites to Q Learning

Before we get to the Q-Learning part, we would have to understand a few important terms.

  • Policy: Remember how we discussed that the agent has to explore the environment, perform certain tasks, and receive maximum rewards? For this whole process, the agent needs a strategy to follow. This strategy is called policy
  • Value Function: The value function is a function that represents the value the agent gets for being in a state or performing a certain action in a certain state. There are two types of value functions – State Value Function and Action Value Function
  • Exploration-Exploitation: The crucial issue of decision-making problems is the exploration-exploitation dilemma. At some point, the agent experiences a dilemma about whether to explore new states or exploit the information it already has about the previous states

What is Q Learning?

Q-learning is an abbreviation for Quality Learning, which focuses on the quality of actions performed rather than how many actions have been performed. It determines the next best action the agent has to perform, based on the current state. This process is iterative until the agent reaches its goal state.

Let us imagine an environment in which the agent has to perform certain actions to reach a goal state.

Q Learning Maze
Q Learning Maze

Imagine a robot who is currently at (5,1). it needs to get the diamond in the cell (1,5). In the path, there are some obstacles(thorns), safe bushes, and empty cells. As expected, for every cell the robot moves, it gets feedback. If it enters a cell with a bush, it is rewarded with +1, if it enters a cell with a thorny crown, it is penalized (receives a value of -1) and if it enters an empty cell, it receives 0 rewards. The agent’s goal is to reach the goal state by taking optimal actions.

Q Learning helps the agent to maximize the cumulative reward, by keeping track of the previous states and actions of the agent and the corresponding rewards.

How does Q Learning keep a tab on the current state of the agent, possible states it can take, and the actions?

The Magic of Q-Table

The answer is Q-Table. Q-Table is essentially a table that stores state-action (Q-values) values. The rows of this table represent states, while the columns represent actions performed. The cells in this table contain the Q-value for each state-action pair.

If we take the example of the robot(discussed in the previous section), it can perform four actions – Move left, right, up, or down. These positions form the columns.

Initially, before the agent starts to explore the environment, the Q-values are set to zeros, and iteratively these values are updated based on the actions taken by the agent.

How are the values updated?

Learning Rule

We can say that Q Learning follows a State-Action-Reward-State-Action sequence. The q-learning rule or update rule represents the relationship between the Q-values of the current state and the Q-values of the possible future states.

The Q Learning update rule is based on the Bellman Equation, which is a fundamental concept in dynamic programming. It considers the value of a decision based on the previous decisions taken.

The learning rule is given below.

Learning rule
Learning rule

The Q-value of the current state is updated based on the estimate of the future value of the agent’s action.

Q Learning also uses temporal difference learning. If you look at the learning rule, the term r+γ.max Q(st+1,a) is the temporal difference learning. It just refers to the current estimate and estimate of the optimal future value.

Characteristics of Q Learning

There are a few characteristics that make this algorithm stand out.

  • Model-free algorithm: Q Learning is a model-free algorithm because it doesn’t need to know the dynamics of the environment(basically, it doesn’t need to know the inner Markov Decision Process of the environment)
  • Off-policy: Unlike other RL algorithms, Q Learning does not follow a policy that is decided. It either follows another optimal policy during the training or no policy at all

Challenges of Q Learning

  • Exploration vs Exploitation Tradeoff: The agent has this dilemma about exploring new states to gather new information or exploring the knowledge available from the previous states to gain maximum rewards. One way to balance the trade-off is to use an Epsilon-greedy strategy
  • High Dimensional states: Sometimes the environment that the agent is exploring may contain high-dimensional spaces, which may be time and resource-consuming to explore

Applications of Q Learning

Q Learning finds its use in some of the industries such as:

  • Air traffic control
  • News Recommendation systems
  • Marketing and Management
  • Gaming


In this short introduction to Q Learning, we have discussed the algorithm, its learning rule, and the very essential q-table. We also discussed the characteristics, applications, and challenges related to Q Learning.