Early, incremental evidence

Many models that are important in applications have state spaces so large that, when the model is naively written, information only becomes available to guide inference at the very end of the computation. This makes it difficult for sequential exploration strategies (such as enumeration and particle filtering) to work. Two common examples are the hidden Markov model (HMM) and the probabilistic context free grammar (PCFG). We first introduce these models, then describe techniques to transform them into a form that makes sequential inference more efficient. Finally, we will consider a harder class of models with ‘global’ conditions.

Unfolding data structures

The HMM

Below, we assume that transition is a stochastic transition function from hidden to hidden states, observeState is an observation function from hidden to observeStated states, and init is an initial distribution.

Notice that if we allow enumeration only a few executions, it will not find the correct state. It doesn’t realize until ‘the end’ that the observations must match the trueObs. Hence, the hidden state is likely to have been [false, false, false].

The PCFG

The PCFG is very similar to the HMM, except it has an underlying tree structure instead of a linear one.

This program computes the probability distribution on the next word of a sentence that starts ‘tall John….’ It finds a few parses that start this way. However, this grammar was specially chosen to place the highest probability on such sentences. Try looking for completions of ‘salty soup…’ and you will be less happy.

Decomposing and interleaving factors

To see how we can provide evidence earlier in the execution for models such as the above, first consider a simpler model:

First of all, we can clearly move the factor up, to the point when it’s first dependency is bound. In general, factor statements can be moved anywhere in the same control scope in which they started (i.e., they must be reached in the same program executions and not cross a marginalization boundary). In this case:

Exposing the intermediate state for HMM and PCFG

In order to apply the above tricks, decomposing and moving up factors, to more complex models, it helps to put the model into a form that explicitly constructs the intermediate states.
This version of the HMM is equivalent to the earlier one, but recurses the other way, passing along the partial state sequences:

Incrementalizing the HMM and PCFG

We can now decompose and move factors. In the HMM, we first observe that the factor factor( arrayEq(r.observations, trueObs) ? 0 : -Infinity ) can be seen as factor(r.observations[0]==trueObs[0] ? 0 : -Infinity); factor(r.observations[1]==trueObs[1] ? 0 : -Infinity); .... Then we observe that these factors can be moved ‘up’ into the recursion to give:

sampleWithFactor

It is fairly common to end up with a factor that provides some evidence just after the sampled value it depends on. If we separate sample and factor, we will often try to explore sample paths that the factor will shortly tell us are very bad. To account for this, we introduce a compound operator sampleWithFactor, that takes a distribution, like sample, and also takes a function that is applied to the sampled value to compute a score for factor. By default, marginalization functions will simply treat sampleWithFactor(dist,params,scoreFn) as var v = sample(dist,params); factor(scoreFn(v)); however, some implementations will use this information more efficiently. The WebPPL enumerate inference method immediately adds the additional score to the score for the state as it is added to the queue – this means that the additional score is included when prioritizing which states to explore next.

More usefully, for the HMM, this trick allows us to ensure that each newObs will be equal to the observed trueObs. We first marginalize out observeState(..) to get an immediate distribution from which to sample, and then use sampleWithFactor(..) to simultaneously sample and incorporate the factor:

We can still insert ‘heuristic’ factors that will help the inference algorithm explore more effectively, as long as they cancel by the end. That is, factor(s); factor(-s) has no effect on the meaning of the model, and so is always allowed (even if the two factors are separated, as long as they aren’t separated by a marginalization operator). For instance:

This will work pretty much any time you have ‘guesses’ about what the final factor will be, while you are executing your program. Especially if these guesses improve incrementally and steadily. For examples of this technique, see the incremental semantic parsing example and the vision example.

There is no reason not to learn heuristic factors that help guide search, as long as they cancel by the end they won’t compromise the correctness of the computed distribution (in the limit). While it wouldn’t be worth the expense to learn heuristic factors for a single marginalization, it may be very useful to do so across multiple related marginal distributions – this is an example of amortized or meta- inference. (Note this is a topic of ongoing research by the authors….)