Sign in to your account

Introduction

Heuristic search algorithms rely on a heuristic function, h, to guide search for planning. The aim of such a heuristic function is to produce a quick-to-compute estimate of the true cost-to-goal, h*, for any given state s.

A well-known property of heuristic search based algorithms like A* or IDA* is that if the heuristic never overestimates the true cost-to-goal - that is, h(s)≤h*(s) - then the plans produced by these algorithms is guaranteed to be optimal. Such a heuristic is called an admissible heuristic.

Learning Heuristics

An alternative approach is to learn heuristics from data using machine learning algorithms. For example, consider the popular 15-puzzle. The aim of the 15-puzzle is to reach a goal state from some start state by sliding a blank tile in a direction and swapping that blank tile with the adjacent number in that direction.

Now, suppose we have a set of optimal plans for a set of 15-puzzle tasks, where each plan is for a 15-puzzle task with a different start state. Then it is possible to use these plans as training data for a supervised learning algorithm, such as a neural network, to learn a heuristic. Since supervised learning algorithms generalise to unseen data, such heuristics can then be applied to new, previously unseen tasks i.e. 15-puzzle tasks with different start states to what was observed in training.

Previous research has shown that such learned heuristics perform well [1]. However, in general such heuristics are not guaranteed to be admissible, and so the plans they produce are not guaranteed to be optimal.

Bootstrap Learning

Producing plans for training data can be prohibitive. This is especially true for domains with large-state spaces for which producing optimal plans is computationally expensive.

An alternative approach is to apply a bootstrap learning procedure that interweaves planning and learning [2].

The bootstrap procedure works as follows:

Start with some initial weak heuristic h:=h0 and some integer k>0. The heuristic h0 can be, for example, a neural network with random weights.

Generate some training tasks, where each task is generated by taking k random steps back from the goal.

Solve the tasks with h0 to obtain a plan for each task.

Train h0 on the plans from step 2. to obtain a stronger heuristic h1.

Repeat steps 1-3 with heuristic h:=h1 and k := k+l where l is a length increment.

The basic idea of the bootstrap procedure is to ensure that at all times the tasks generated are easily solvable by the current heuristic. This is achieved by generating incrementally more difficult tasks, while at all times keeping the current heuristic strong enough by solving and then learning from the previous batch of generated tasks.

Utilising Uncertainty in Bootstrap Learning

The bootstrap procedure described above has two major limitations.

It is sample inefficient. This because the procedure works by taking random steps back from the goal rather than having more targeted approach.

It outputs a final heuristic that produces plans with high sub-optimality on test tasks. This is because errors compound with each iteration of the procedure.

In our paper we propose to overcome both of these limitations by leveraging epistemic and aleatoric and uncertainty. Epistemic uncertainty accounts for uncertainty in the model of a process and can be explained away given enough data while aleatoric uncertainty accounts for uncertainty that the model cannot explain away due to probabilistic variation.

As an illustrative example of these uncertainties, consider a biased coin that lands on heads with probability p, and assume we do not know p. Suppose we have never flipped the coin before. Since there is no data to estimate p and we would have high epistemic uncertain over p. As more data is obtained by flipping the coin, we can estimate p more accurately and thus reduce our epistemic uncertainty.

Coin tossing is modelled as a Bernoulli distribution which is known to have variance p(1-p). Therefore, even if we know p, the process of coin tossing still has probabilistic variation that cannot be explained away no matter how much data we observe, and this is called aleatoric uncertainty.

While we have explained the concepts of epistemic and aleatoric uncertainty in the context of modelling a coin toss, the idea can be generalised to models of other processes. In particular, Bayesian neural networks are neural network architectures that, given some input, produce measures of epistemic and aleatoric uncertainty in addition to making a mean prediction [3,4].

We therefore propose two enhancements to the bootstrap learning procedure:

Rather than taking random steps back from the goal, we use epistemic uncertainty for a more targeted approach.

Rather than planning with the mean, we use a more conservative statistic.

For the first enhancement, consider the 15-puzzle domain. To generate a task we start at the goal state, as seen in the picture below (first image), and compute the epistemic uncertainty for every possible next state (top right of the tiles). These epistemic uncertainties are converted to a softmax distribution (bottom right of the tiles) and a particular next state is sampled from this distribution. These steps repeat until we observe a next state with epistemic uncertainty above some threshold (in this example the threshold is 1). At that point we generate a task with start state equal to that next state (last image).

The key idea behind this approach is that we seek uncertainty when generating tasks because we are more likely to sample states with high epistemic uncertainty. These tasks are then solved by the current heuristic to produce plans. In the next training iteration, the heuristic learns from these plans, thereby reducing the epistemic uncertainty on these tasks. Consequently, In the next run of task generation the procedure will generate tasks that are different from what was previously observed in training. This approach to task generation is more sample efficient that taking random steps back from the goal because the steps are directed to areas of the state-space that are still unexplored.

For the second enhancement, we propose to use a more conservative statistic than the mean. In particular a standard feed-forward neural network only produces a single output which predicts the mean of the process as inferred from the training data.

In our paper we assume that the output follows a normal distribution, and so with Bayesian neural networks the output is an entire distribution parameterised by a mean output and a variance output. The variance output is comprised from a function of the epistemic and aleatoric uncertainty measures produced by the network. Therefore, rather than using the mean we use a more conservative statistic, such as the 25th percentile, during planning.

More formally, this produces a heuristic that is likely-admissible [1] (admissible with some probability) and therefore the heuristic produces plans that are likely-optimal. This approach overcomes the compounding error problem that exists with previous approaches to bootstrap learning that rely only on the mean because the plans that the heuristic learns from remain likely-optimal in each iteration.

In our paper we run experiments on the 15-puzzle, 24-puzzle, 24-pancake and 15-blocksworld domains illustrating that our approach is able to efficiently learn likely-admissible heuristics for these domains.