Backpropagate
Posts
Teaching Series #10 - Monte Carlo Tree Search

Teaching Series #10 - Monte Carlo Tree Search

"...The near optimal solutions are out there, you just need to start searching..."

Dr Milos Ojdanic
July 06, 2025 • Estimated Reading Time: 6 minutes

Monte Carlo Tree Search (MCTS)

In the previous lesson, we learned that Monte Carlo simulations provide us with a distribution of probabilities assigned to specific outcomes of an action we aim to take. We also saw that these probabilities come from iterative sampling (also known as repetition, simulation, experimentation etc), and it is rooted in the law of large numbers. Meaning that if we repeat our sampling/repetition/simulation/experimentation enough times, we can have an approximate idea of what the most appropriate outcome for our action would be.

This being clear, this lesson will build on the previous knowledge and brush up on the primary usage of Monte Carlo methods of simulation. And that is in the search algorithms.

The search in this context refers to finding the best sequences of actions to reach a goal, especially in problems that have a huge number of possibilities. Traditional search algorithms that we covered previously would explore every single possible path.

Monte Carlo, on the contrary, instead of exhaustively searching every option, uses random sampling to explore the most promising paths. It operates in a loop, iteratively building the search tree. Each iteration consists of four main phases. Selection, starting from the current root node, MCTS traverses the existing tree by selecting the most promising child nodes based on a specific strategy. The strategy is often a balance between exploring new paths and exploiting known good paths. The traversal per iteration continues until reaching the final node that hasn’t been explored yet (leaf node). Next is Expansion, if the selected leaf node is not a terminal state (like game over in the gaming context), the algorithm creates one or more new child nodes representing possible next actions from that state. From one of the newly expanded or existing nodes, MCTS performs a random simulation (also known as rollout) until a terminal state is reached. This is like playing out the rest of the game randomly from that point. The outcomes are recorded, and the idea is to do as many trials as possible. The result (win, lose, success, failure, good, bad, etc) of the trial (search iteration or simulation), is then “backpropagated”, up to the tree to all the ancestor nodes that were visited during the selection phase. This updates the statistics contained inside each of the nodes, making them more desirable for future selections.

The MCTS repeats these steps millions of times, constantly refining statistics.

One of the most useful properties of MCTS is that it often requires no prior domain knowledge. It records the best actions purely through playouts and statistical aggregations.

Implementation

Let’s try to gamify our example again by using the Tic Tac Toe game.

As we saw in the previous lessons, the game can be represented as a tree. Where each node is a state of a game (given player action X or O).

So here is what we are going to implement and what can be found in our codebase. In essence, we will start traversing the tree. Meaning we start from the root node that represents a game state, down to a "promising" leaf node that is either: a terminal node (game over), or a node that is not yet fully expanded (meaning there are still untried moves/actions from this state for which we haven't created child nodes in the tree yet). The core idea for the selection phase is to use the UCB1 (Upper Confidence Bound 1) formula to decide which path to follow. UCB1 balances two crucial aspects:

Exploitation: Choosing moves/paths that have historically led to good outcomes (high average rewards).
Exploration: Choosing moves/paths that haven't been tried very often, to discover potentially even better paths.

Next, based on the selection, we will be expanding each node until we explore all possible actions (children). After that, if the selected node (state) is terminal, which means we can get a reward, or it is possible to expand, we continue to the expansion phase in case of the latter case. In case it is not terminal, we expand by creating a new node (state) by selecting out of the possible actions. When a new node is created, we perform a simulation by performing actions on the state to observe what may happen from this node in the future. We do the simulation until we reach the terminal state, for which we get the reward. For the last step of backpropagation, we update the current simulated node and all its parent nodes. The reward obtained from the simulation is propagated upwards. For single-player optimisation or problems where a higher value is always better, the same reward value is typically added to each node. For our two-player zero-sum game, like Tic-Tac-Toe, the reward needs to be inverted as it propagates between nodes belonging to opposing players. If Player X wins (reward 1.0 for X), then the node where Player O made a move leading to X's win should see this as a loss (reward 0.0 for O).

Final Decision: Selecting the Best Next Move from the Root

After thousands of MCTS iterations, the tree contains a wealth of statistical information. The final step is to leverage this information to choose the best actual move to make from the current game state (the root node).

For implementation details and making your fingers dirty and brain tired, check the codebase and don’t hesitate to play.

Final note: Is MCTS Optimal?

MCTS is not inherently an optimal algorithm in the way exhaustive search algorithms are. MCTS is a heuristic search algorithm that uses random sampling to explore a search space. It strives to find good solutions, often near-optimal solutions, but it doesn't guarantee finding the absolute best solution, especially with a finite number of iterations. The "Simulation (Rollout)" phase is purely random. While it provides an unbiased estimate of a node's value, it doesn't deterministically find the best outcome from that point. A lucky or unlucky random rollout can influence the statistics of nodes.

Happy learning,

Sincerely, MO