A reinforcement-learning agent chooses its actions based on its internal memory state. The memory doesn't have to include an exact account of all past observations - it's sufficient for the agent to keep track of a belief of the current state of the world.

This mitigates the problem of having the size of the memory state grow indefinitely. The memory-state space is now the set of all distributions over the world state. Importantly, this space doesn't grow exponentially with time like the space of observable histories. Unfortunately, it still has volume exponential in the number of world states.

So we moved away from specifying the action to take after each observable history, in favor of specifying the action to take in each belief state. This was justified by the belief being a sufficient statistic for the world state, and motivated by the belief space being much smaller (for long processes) than the history space. But for the purpose of understanding this algorithm, we should go back to describing the action policy in terms of the action to take after each observable history, π(At|O1,...,Ot).

Now join us as we're watching the dynamics of the process unfold. We're at time t, and the agent has reached Bayesian belief Bt, the distribution of the world state to the best of the agent's knowledge. What prospects does the agent still have for future rewards? What is its value-to-go of being in this belief state?

If we know what action the agent will take after each observable history (if and when it comes to pass), it turns out that the expected total of future rewards is linear in Bt. To see why this is so, consider a specific future (and present):

We arrive at this by starting with the agent's belief Bt, which summarizes the distribution of wt given the observable history; then going on to multiply the probability for each new variable, given those before it that have part in causing it.

Note how Bt linearly determines the probability for wt, while the dynamics of the world and the agent determine the probability for the rest of the process. Now, to find the expected reward we sum, over all possible futures, the product of the future's reward and probability. This is all linear in the belief Bt at time t.

If the value of choosing a specific action policy is a linear function of Bt, the value of choosing the best action policy is the maximum of many linear functions of Bt. How many? If there are 2 possible observations, the observable history that will have unfolded in the next n-t steps can be any of 2n-t possible "future histories". If there are 2 possible actions to choose for each one, there are 22n-t possible policies. A scary number if the end is still some time ahead of us.

Of course, we're not interested in the value of just any belief, only of those that can actually be reached at time t. Since the belief Bt is determined by the observable history o1,...,ot, there are at most 2t such values. Less scary, but still not practical for long processes.

The secret of the algorithm is that it limits both of the figures above to a fixed, manageable number. Suppose that we have a list of 100 belief states which are the most interesting to us, perhaps because they are the most likely to be reached at time t. Suppose that we also have a list Πt of 100 policies: for each belief on the first list, we have in the second list a policy that gets a good reward if indeed that belief is reached at time t. Now what is the value, at time t, of choosing the best action policy of these 100? It's the maximum of, not many, but |Πt|=100 linear functions of Bt. Much more manageable.

Ok, we're done with time t. What about time t-1? We are going to consider 100 belief states which can be reached at time t-1. For each such belief Bt-1, we are going to consider each possible action At-1. Then we are going to consider each possible observation Ot that can come next - we can compute its probability from Bt-1 and the dynamics of the system. Given the observation, we will have a new belief Bt. And the previous paragraph tells us what to do in this case: choose one of the policies in Πt which is best for Bt, to use for the rest of the process.

(Note that this new Bt doesn't have to be one of the 100 beliefs that were used to construct the 100 policies in Πt.)

So this is how the recursive step goes:

For each of 100 possible values for Bt-1

For each possible At-1

For each possible Ot

We compute Bt

We choose for Bt the best policy in Πt, and find its value (it's a linear function of Bt)

We go back and ask: how good is it really to choose At-1? What is the expected value before knowing Ot?

We choose the best At-1

This gives us the best policy for Bt-1, under the restriction that we are only willing to use one of the policies in Πt from time t onward

After going over the 100 beliefs for time t-1, we now have a new stored list, Πt-1, of 100 policies for time t-1.

We continue backward in time, until we reach the starting time t0, where we choose the best policy in Πt0 for the initial (prior) belief Bt0.

We still need to say how to choose those special 100 beliefs to optimize for. Actually, this is not a fixed set. Most point-based algorithms start with only one or a few beliefs, and gradually expand the set to find better and better solutions.

The original algorithm chooses beliefs in a similar way to an internal availability mechanism that may be responsible for conjuring up images of likely futures in human brains. It starts with only one belief (the prior Bt0) and finds a solution where only one policy is allowed in each Πt. Then it adds some beliefs which seem likely to be reached in this solution, and finds a new and improved solution. It repeats this computation until a good-enough solution is reached, or until it runs out of time.

Point-based algorithms are today some of the best ones for finding good policies for Bayesian agents. They drive some real-world applications, and contribute to our understanding of how intelligent agents can act and react.