Backpropagate
Posts
Teaching Series #6 - Markov Decision Process

Teaching Series #6 - Markov Decision Process

"...Only present defines your next step..."

Dr Milos Ojdanic
June 14, 2025 • Estimated Reading Time: 9 minutes

Imagine you are watching a game on TV, and the outcome of each round is a bit random – maybe it's a card game, darts or something with rolling the dice. A stochastic process is simply a sequence of random events, where each event has some probability associated with it (it does not hurt to have this clear 🙂 )

To understand Markov Decision Processes, we also need to understand what the Markov Property is. It is simple, but very important. The property states that the future state of the process depends only on the current state, and not on the sequence of events that led to that current state. Imagine a simple coin flip game. If you are trying to predict the next flip, all you need to know is if the coin is fair – if it is, the previous flips don't change the probability of the next one being heads or tails.

It's like saying, "what happens next in this game only depends on the current board position and the cards I have at the moment, not on how I got to this position." The past is irrelevant once you know the present!

Now, imagine having multiple states or situations like this. Chained together, where you can go from one state to the other state. This chain is called a Markov Chain, and it is a mathematical model that describes a sequence of possible events where the probability of each event depends only on the state achieved at the previous event. Think of it as a journey through different "states" (like "Sunny," "Cloudy," "Rainy" for weather) where the transitions between these states are probabilistic and, critically, follow the Markov Property.

We can organise all these transition probabilities into a table called a transition matrix, often denoted by P. Each row represents the current state, and each column represents the next state.

Notice that the probabilities in each row must always sum up to 1 (because you have to go somewhere from the current state!). For example, 0.8 + 0.2 + 0.0 = 1.

Now we have individual transition probabilities! That's a great start! 🙂

Now, very exciting and interesting part: we can use the transition matrix to predict the future!

The beauty of the Markov Chain and its transition matrix is that we can use it to answer questions like: "If it is sunny today, whats the probability it will be rainy in two days?" or "What is the probability of a specific sequence of events happening?"

Predicting the State After One Step

If you know the current state, predicting the next state is straightforward: you just look at the relevant row in the transition matrix. If it's Sunny today (S), the probability of being Sunny tomorrow is 0.8, Cloudy is 0.2, and Rainy is 0.0.

Predicting the State After Multiple Steps

This is where matrix multiplication comes beautifully. If P is the one-step transition matrix, then P² ( P multiplied by itself) gives you the two-step transition probabilities.

So, if we want to know the probability of going from state i to state j in two steps, we would look at the element in the i-th row and j-th column of P².

In the same way, Pⁿ would give you the probabilities of transitioning between states in n steps.

So, if it's Sunny today, there's a 6% chance it will be Cloudy tomorrow and then Rainy the day after.

This process highlights how the Markov Property simplifies things: each step only depends on the immediately preceding state.

The matrix multiplication comes into play when you want to know the overall probability of being in a certain state after 'n' steps, regardless of the specific path taken. This is where Pⁿ becomes super useful.

Imagine how weather system keeps evolving day after day, week after week. If we run the Markov Chain for a very, very long time, something interesting often happens: the probability of being in any given state (Sunny, Cloudy, or Rainy) tends to settle down and become constant, regardless of the initial state. This long-term probability distribution is the steady-state distribution.

The steady-state distribution tells us the long-term behaviour of the system. For the weather example, it would tell, on average, what percentage of days are Sunny, what percentage are Cloudy, and what percentage are Rainy over a very long period, even if we don't know what todays weather is. It is incredibly useful for predicting the long-run proportions of time spent in each state.

Markov Decision Processes (MDPs)

Think of an MDP as a Markov Chain with a super important upgrade: a decision-making agent. In a standard Markov Chain, the transitions between states are fixed probabilities (e.g., weather just changes on its own). In an MDP, an agent is in a state, and it gets to choose an action, and that action leading to the next state, impacts the agent and gives a reward based on how good it is.

MDPs are formally defined by five key components, often remembered by the tuple (S, A, P, R, γ - gamma):

States (S): This is the set of all possible situations or environments the agent can be in. Just like in Markov Chains (e.g., Sunny, Cloudy, Rainy).
Actions (A): This is the set of all possible actions the agent can take in any given state. For example, { “Take an umbrella”, “Leave an umbrella”}.
Transition Probabilities (P): This is where your chosen action comes in! Now, P(s′ ∣ s, a) is the probability of transitioning to state s′ given that the agent was in state s and took action a. So, it's not just P_ijanymore, but P_(s_current_→s_next_{∣ action )}.
Reward Function (R): This tells the agent how much immediate reward it gets for taking action a in state s and transitioning to state s′. This is crucial because the agents goal is to maximise total rewards over time. For example, taking an umbrella might give +10 reward, getting caught in unexpected rain while having no umbrella might give -5.
Discount Factor (γ - gamma): A high γ (closer to 1) means future rewards are almost as important as immediate ones. A low γ (closer to 0) means the agent cares mostly about immediate rewards. This is a value between 0 and 1. It determines the present value of future rewards. A discount means how much value we get now, over how much value we might gain in future.

The core idea of an MDP is that an agent is trying to find the best sequence of actions (a policy) to maximise the total, discounted rewards it receives over a long period.

What is a Policy?

In the world of MDPs, a policy, usually denoted by π (pi), is essentially the agent "strategy" or "rulebook" for making decisions. It tells the agent what action to take in every possible state.

A policy can be:

Deterministic: This means for a given state, theres only one action the policy will choose. It's like saying, "If you are in State A, ALWAYS do Action X."
Stochastic (or Probabilistic): This means for a given state, the policy specifies a probability distribution over actions. For example, "If you are in State A, take Action X with 80% probability and Action Y with 20% probability." This introduces a bit of randomness into the agent decision-making itself.

The ultimate goal when working with MDPs is to find an optimal policy, denoted π^∗. This is the policy that will maximise the agent expected cumulative reward over the long run, taking into account the discount factor. It's the "best" rulebook for the agent to follow to succeed in its environment.

What does the expected cumulative reward mean?

Sounds a bit complex, right 🙂 Well, it is not. It means that the agent is trying to maximise the total reward it expects to receive over its entire future, taking into account the discount factor. This total future reward is often called the cumulative reward or return.

A smart agent using an MDP framework would choose the action that leads to the highest long-term happiness (total reward), not just the immediate satisfaction. The discount factor (γ) comes into play here because future rewards are generally worth a bit less than immediate ones.

The challenge, and the central problem in solving an MDP, is to find that optimal policy (π^∗) – the one strategy that, if followed consistently, will lead to the greatest expected cumulative discounted reward from any starting state.

So, we covered many concepts in this lesson, and we are opening the door to many more. In essence, we are right in front of the door of concepts like Value Functions and the Bellman Equation to find the optimal policy (agent strategy for actions to take). Which we are going to cover next, but this may be the best moment for a break and thinking about what we went through.

In the next lessons, we are going to cover all the mathematical tools we need to calculate how "good" a state or action is in terms of future rewards, and therefore, help us find that optimal policy.

Keep being curious,

Sincerely, MO