The Animated Monte-Carlo Tree Search (MCTS)
The algorithm at the heart of AlphaGo, AlphaGo Zero, AlphaZero and MuZeroMonte-Carlo Tree Search (MCTS) is a heuristic search method, whose main purpose is to choose the most promising move, given the current state of a game. It does so through repeatedly playing the game starting from the current state, thus identifying promising moves and quickly abandoning the unpromising ones. As we will shortly see, the challenge lies in performing as little play-outs as possible, while looking ahead into the future as far as possible. MCTS is the key algorithm behind many major achievements in recent AI applications such as the impressive AlphaGo showdown in 2016, where AlphaGo outplayed the second-highest ranked Go player.At first glance MCTS may appear as a simple algorithm, however it features a high level of hidden complexity, such as an inherent “exploitation-exploration trade-off”. Due to the highly dynamic nature of the algorithm — the search tree grows and is updated on each iteration — it is best explained using animations, a lot of animations. This will help you to quickly develop a solid mental concept of MCTS, making it much easier to understand groundbreaking advances like the famous AlphaGo, where MCTS is combined with neural networks.In this article, we will start with the motivation behind MCTS. Next, we will introduce all the single steps the algorithm is made of: selection, expansion, simulation and backpropagation. We take a particularly close look at the backpropagation, as it is neglected in many tutorials. Finally, we will take a very close look at the mechanism behind the “exploitation-exploration trade-off”.EnvironmentWe are going to use a deterministic version of the famous environment grid world to get familiar with MCTS. In this version the agent accurately obeys our commands without slipping to the sides randomly:Deterministic grid world (Image by author)As you can see from the animation, we can choose from the four different compass directions (North, West, South, East). There are two final states: a winner state with a reward of +1 and a loser state with a reward of -1. All other states return a small negative reward of -0.02, which discourages wandering around aimlessly and that encourages taking the shortest route. The black cell is an obstacle and the agent cannot go through this cell.If the agent bounces off the wall, it just stays where it is, however collecting a negative reward of -0.02.Agent bounces off the wall (Image by author)At the end of this post we will extend the grid world to have stochastic dynamics: under these conditions, if we command the agent to go north, there is, for example, a 10% chance that it will accidentally gear off to the left and a 10% chance that it will gear off to the right, and only a 80% chance that it will manage to go the direction we commanded.Stochastic Environments (Image by author)Last but not least, as a planning algorithm MCTS does require the environment to be able to store a state and to load a state, such that we can perform multiple play-outs, starting from the same state.Storing and reloading of a state (Image by author)As a side note: the grid world is a small example of what is called a Markov decision process (MDP).MotivationOur goal is to find the quickest path to the positive reward in the upper right corner of the grid world. We will first devise a simple brute-force algorithm, which systematically tries out all possible combinations of actions.Search tree, numbers below each node indicate total rewards (Image by author)First notice, how the representation of all the possible future states naturally leads to a tree-like structure. Each node of the tree represents a particular state in the game, whereas edges represent different actions. The child nodes of each node hence represent all possible next states a player can reach from there. The terminal states of the game are represented by leaf nodes and do not possess any children. The node at the top is called root node.Please notice the hierarchical organization of the tree: the root node belongs to level 0, its child nodes belong to level 1 and their children in turn belong to level 2 and so on. Moreover, each level represents a different future point in time: level 0 represents the current step, level 1 represents all possible future states after taking a single action, level 2 represent all possible future states after taking two actions and so on.Now, we notice an interesting thing: the total rewards (the numbers underneath each node) in all levels explored so far are identical. In particular, this means, that even after reaching level 2, we still cannot decide which action to take next. Thus, we need to expand the tree much deeper:Shortest paths to the winner state after choosing an initial action (Image by author)We see in the left part of the figure above, that by either taking the “up” or “right” action in the root state, we can reach the winner state in 5 steps. The maxima

The algorithm at the heart of AlphaGo, AlphaGo Zero, AlphaZero and MuZero
Monte-Carlo Tree Search (MCTS) is a heuristic search method, whose main purpose is to choose the most promising move, given the current state of a game. It does so through repeatedly playing the game starting from the current state, thus identifying promising moves and quickly abandoning the unpromising ones. As we will shortly see, the challenge lies in performing as little play-outs as possible, while looking ahead into the future as far as possible. MCTS is the key algorithm behind many major achievements in recent AI applications such as the impressive AlphaGo showdown in 2016, where AlphaGo outplayed the second-highest ranked Go player.
At first glance MCTS may appear as a simple algorithm, however it features a high level of hidden complexity, such as an inherent “exploitation-exploration trade-off”. Due to the highly dynamic nature of the algorithm — the search tree grows and is updated on each iteration — it is best explained using animations, a lot of animations. This will help you to quickly develop a solid mental concept of MCTS, making it much easier to understand groundbreaking advances like the famous AlphaGo, where MCTS is combined with neural networks.
In this article, we will start with the motivation behind MCTS. Next, we will introduce all the single steps the algorithm is made of: selection, expansion, simulation and backpropagation. We take a particularly close look at the backpropagation, as it is neglected in many tutorials. Finally, we will take a very close look at the mechanism behind the “exploitation-exploration trade-off”.
Environment
We are going to use a deterministic version of the famous environment grid world to get familiar with MCTS. In this version the agent accurately obeys our commands without slipping to the sides randomly:
As you can see from the animation, we can choose from the four different compass directions (North, West, South, East). There are two final states: a winner state with a reward of +1 and a loser state with a reward of -1. All other states return a small negative reward of -0.02, which discourages wandering around aimlessly and that encourages taking the shortest route. The black cell is an obstacle and the agent cannot go through this cell.
If the agent bounces off the wall, it just stays where it is, however collecting a negative reward of -0.02.
At the end of this post we will extend the grid world to have stochastic dynamics: under these conditions, if we command the agent to go north, there is, for example, a 10% chance that it will accidentally gear off to the left and a 10% chance that it will gear off to the right, and only a 80% chance that it will manage to go the direction we commanded.
Last but not least, as a planning algorithm MCTS does require the environment to be able to store a state and to load a state, such that we can perform multiple play-outs, starting from the same state.
As a side note: the grid world is a small example of what is called a Markov decision process (MDP).
Motivation
Our goal is to find the quickest path to the positive reward in the upper right corner of the grid world. We will first devise a simple brute-force algorithm, which systematically tries out all possible combinations of actions.
(Image by author)
First notice, how the representation of all the possible future states naturally leads to a tree-like structure. Each node of the tree represents a particular state in the game, whereas edges represent different actions. The child nodes of each node hence represent all possible next states a player can reach from there. The terminal states of the game are represented by leaf nodes and do not possess any children. The node at the top is called root node.
Please notice the hierarchical organization of the tree: the root node belongs to level 0, its child nodes belong to level 1 and their children in turn belong to level 2 and so on. Moreover, each level represents a different future point in time: level 0 represents the current step, level 1 represents all possible future states after taking a single action, level 2 represent all possible future states after taking two actions and so on.
Now, we notice an interesting thing: the total rewards (the numbers underneath each node) in all levels explored so far are identical. In particular, this means, that even after reaching level 2, we still cannot decide which action to take next. Thus, we need to expand the tree much deeper:
We see in the left part of the figure above, that by either taking the “up” or “right” action in the root state, we can reach the winner state in 5 steps. The maximal possible reward amounts to 0.92. Taking the “down” or “left” action, depicted on the right-hand side of the figure, involves at least one bump against a wall, only allowing the agent to reach the winner state at level 6 with the final reward reduced to 0.9.
We can conclude that expanding the tree in depth is crucial for proper decision-making. The flip side of the coin however is, that the amount of states we need to visit rises exponentially with each additional level:
- At level 1: 4
- At level 2: 4² = 16
- At level 3: 4³ = 64
- At level 4: 4⁴ = 256
- At level 5: 4⁵ = 1024
- …
- At level 10: 4¹⁰ = 1.048.576
The 4 used above is the so called branching factor and corresponds to the number of actions we can take in each state. Go has a branching factor of approximately 250 (!!!), whereas chess typically has around 30 actions to choose from. Expanding just 5 steps ahead in Go, leads us to an astronomical number of possible states: 250⁵ = 976.562.500.000.
Visiting each and every node of the search tree is impractical, especially since for proper decision making we must expand the tree in depth to look further ahead into the future. But how can we make the brute-force tree-search algorithm above feasible for more complex tasks?
The solution that MCTS pursues, is to run repeated simulations called roll-outs or play-outs. In each play-out, the game is played out to the very end by selecting moves at random. The final game result is used to quickly identify and abandon unpromising paths and only concentrate on the more promising ones.
The total reward we obtain in the roll-out depicted above amounts to 0.62. Remember, that the shortest paths we explored before yielded a total reward of 0.92. We hence see a high variance in the outcomes of different simulations. In order to gain confidence about the statistics, we need to repeatedly runs simulations in the environment. This repeated random sampling of the search tree is what is implied by the name Monte-Carlo.
One last remark regarding the second phrase “tree-search” in MCTS: remember that the systematic traversal of a tree in order to find an optimal path is called: tree-search.
Initialize the root-node
The first thing we do is to reset the environment in order to start from a controlled state. We than save the state (it is actually a checkpoint which fully describes the initial state of the environment). Subsequently, we create a new search tree node and append the just obtained state to it:
As we have seen in the introduction, a search tree is a hierarchical structure, where each node can be connected to many children, but must be connected to exactly one parent, except for the root node, which has no parent. An abstract data type representing a node must hence keep a list with pointers to the child nodes and a pointer to the parent node. Other important member variables of the node structure are:
- reward: reward for taking action a in state s
- value_sum: sum of all obtained value estimates
(see definition of value below) - visit_count: how many play-outs contribute to value_sum
- state: checkpoint describing the full state of the environment
In addition we will keep a list of all yet unvisited actions and a boolean variable indicating whether the state is terminal or not.
Expansion Step
We start by loading the state connected to the root node into the environment. Next, we pick an unexplored action from the set of unvisited actions. We step the environment using the just picked action and note down returned values such as reward, done, etc.
Similar to the initialization step, we then load the new state from the environment, create a new search tree node and connect it to the state. Finally, we populate the reward member variable (and maybe other variables) with the just obtained values. Last but not least, we add a pointer from the root to the newly created child node and the pointer from the child node to the parent.
Please note, that in the MCTS version presented in this tutorial, we expand only a single node before continuing with the simulation step. Other flavors of MCTS might expand all child nodes first, before continuing with the simulation steps.
Simulation Step
The main purpose of the simulation step is to obtain an estimate of how promising the currently visited state is. Remember, that we define the notion of “how good” a state is, in terms of the cumulative reward that is expected to be collected from that state on, going into the future:
The cumulative reward, also called return, is calculated for the trajectory obtained during a random roll-out:
which starts at current state at timestep t, is played to the very end of the game, i.e. until a terminal state is reached, and whose actions are taken according to a random roll-out policy.
It is common to use a discount factor