Backpropagate
Posts
Teaching Series #7 - Q learning

Teaching Series #7 - Q learning

"...Sometimes the second-best immediate action leads to the long-term best outcome..."

Dr Milos Ojdanic
June 17, 2025 • Estimated Reading Time: 8 minutes

In the previous lesson, we covered Markov Decision Process (MDP). We talked from scratch what it is, how important it is, and in a nutshell, that leads us to a policy that represents a strategy of moving or taking specific actions in an environment based on states, states transitions probabilities, rewards associated to an action, a discount factor that helps us make a trade of between short term satisfaction or long term happiness 🙂

In any MDP, the agent isn't just trying to get a high reward now. It's trying to maximise the total reward it expects to receive over its entire future, taking into account the discount factor. This total future reward is often called the cumulative reward or return.

The discount factor (γ) comes into play here because future rewards are generally worth a bit less than immediate ones. Meaning: before you make future actions and steps, you need to make immediate ones. Thus, a tradeoff is required in terms of a factor that specifies being “brave” in taking immediate actions, even if not being the best, and trusting the future 🙂

The challenge, and the central problem in solving an MDP, is to find that optimal policy (π^∗) – the one strategy that, if followed consistently, will lead to the greatest expected cumulative discounted reward from any starting state.

To find that optimal policy that maximises cumulative reward, we need a way to measure how "good" a particular state is, or how "good" it is to take a particular action in a particular state. This is exactly what Value Functions do!

There are two main types of value functions in MDPs:

State-Value Function (V_π(s)):

This function tells us how good it is to be in a particular state s, assuming the agent follows a specific policy π from that state onwards.
More formally, V_π(s) is the expected cumulative discounted reward an agent can expect to receive starting from state s and following policy π.

Action-Value Function (Q_π(s,a)):

This function tells us how good it is to take a particular action a when in a particular state s, again assuming the agent follows a specific policy π after taking that initial action.
More formally, Q_π(s,a) is the expected cumulative discounted reward an agent can expect to receive if it starts in state s, takes action a, and then follows policy π afterwards.

The ultimate goal in MDPs is to find the optimal value functions, V^∗ (s) and Q^∗ (s,a). These represent the maximum possible expected cumulative rewards when following the optimal policy (π^∗ ). If you know Q^∗ (s,a), finding the optimal policy is easy: in any state s, you just pick the action a that gives you the highest Q^∗ (s,a) value!

Why in that case we need State-Value V(s) in finding the optimal policy?

Even tho V(s) doesn't directly tell you which action to take, it is still crucial. In algorithms for solving MDPs, we first try to find the optimal state-value function, V^∗(s). This function tells us the maximum expected future reward achievable from each state, assuming we act optimally.

Once you have V^∗ (s), you can then use it to derive Q^∗ (s,a)!

Q^∗ (s,a) can be thought of as: The immediate reward you get for taking action a in state s, PLUS, the discounted maximum future reward you can get from the next state s′ that you transition to, assuming you act optimally from s′ onwards.

So, in other words, if you know V^∗ (s′), you can calculate Q^∗ (s,a) for all actions. Once you have all the Q^∗ (s,a) values for a given state s, you simply pick the action that leads to the highest Q^∗ (s,a) to define your optimal policy.

Bellman Equation

The Bellman Equation, named after Richard Bellman, is a fundamental concept in dynamic programming. It recursively defines the value functions. In essence, it states that the value of a state (or state-action pair) can be expressed in terms of the immediate reward and the values of the next possible states.

So, the Bellman Equation for V^π (s) is essentially saying: "The value of state s is the expected sum of (immediate reward + discounted value of the next state), averaged over all actions specified by the policy and all possible next states from those actions."

Bellman Optimality Equation

Instead of averaging over actions specified by a policy, we maximise over all possible actions:

This equation says the optimal value of taking action a in state s is the immediate reward plus the discounted optimal future value, where that future value is the maximum Q^∗ value achievable from the next state s′ (by picking the best action a′ in s′).

The Bellman Equations are a set of simultaneous equations that can be solved to find the value functions and, from there, the optimal policy.

Why is the Q-function (Action-Value Function) extremely important, especially in the context of learning?

The Q-function (Q(s,a)) quantifies the "quality" or "value" of taking a specific action a when in a specific state s. Remember, it represents the expected cumulative discounted reward you will receive if you take action a in state s, and then continue to follow your policy (or the optimal policy) thereafter.

Direct Action Choice: If you know Q^∗ (s, a) (the optimal Q-value), then finding the optimal action in any state s is trivial: just pick the action a that gives the highest Q^∗ (s,a)! π^∗ (s) = max_aQ^∗ (s,a). This makes it easy for an agent to decide what to do.
Model-Free Learning: Many reinforcement learning algorithms, including Q-learning, can learn the optimal Q-function without knowing the transition probabilities P(s′∣ s,a) or the reward function R(s, a, s′) in advance. This means the agent can learn simply by interacting with its environment, trying actions, and observing the immediate rewards and next states. It doesn't need a "map" or "rulebook" of the world; it builds its own understanding through experience.

Think of an agent trying to learn to play a new video game. It doesn't start with the game code (transition probabilities). Instead, it tries different button presses (actions) in different screens (states), observes what happens (next states) and what score it gets (rewards), and from this experience, it starts to figure out which actions are "good" in which situations to maximise its score over time. The Q-function becomes its internal scorecard for every possible move!

Q-Learning - here we are - finally 🙂

Q-learning is a model-free reinforcement learning algorithm. "Model-free" means it doesn't need to know the transition probabilities (P(s′ ∣ s,a)) or the reward function (R(s, a, s′)) of the environment beforehand. Instead, it learns these indirectly by repeatedly interacting with the environment, taking actions, observing rewards, and seeing what new states it lands in.

At its core, Q-learning maintains a table (or some representation) of Q-values for every possible state-action pair, Q(s,a). The goal is to iteratively update these values until they converge to the optimal Q-values, Q^∗ (s,a).

In simple terms, the update rule says: "Adjust your current estimate of Q(s,a) by adding a fraction (α-lambda) of the 'surprise' you just experienced. This surprise is calculated by comparing your old estimate to the immediate reward you got, plus the best discounted future reward you now anticipate from the new state."

Through many iterations of interaction with the environment and applying this update rule, the Q(s,a) values gradually converge to Q∗(s,a), the optimal Q-values!

The final piece of the Q-learning is a challenge that every type of learning faces: the Exploration-Exploitation Dilemma of balancing these two needs:

Exploiting your current knowledge to maximise immediate reward.
Exploring new possibilities to discover more about the environment and potentially find even greater long-term rewards.

A very common and intuitive strategy to balance this is called Epsilon-Greedy (ϵ-greedy). With a small probability, ϵ (epsilon, typically a value like 0.1 or 0.05), you would explore by choosing a random action (from all available actions). With the remaining probability, 1 − ϵ, you would exploit by choosing the action that currently has the highest estimated Q-value for the current state.

And that is all, really 🙂 Now, if you want to practice and experience using the Q-learning update rule, check out the codebase that implements simple Grid-wall path exploration.

Keep learning,

Sincerely, MO