Nesting Probabilistic Programs

Tom Rainforth
Department of Statistics
University of Oxford
rainforth@stats.ox.ac.uk

Abstract

We formalize the notion of nesting probabilistic programming queries
and investigate the resulting statistical implications. We demonstrate
that query nesting allows the definition of models which could not
otherwise be expressed, such as those involving agents reasoning about other agents,
but that existing systems take approaches that lead to
inconsistent estimates.
We show how to correct this by delineating possible ways one might want to nest
queries and asserting the respective conditions required for convergence.
We further introduce, and prove the correctness of,
a new online nested Monte Carlo estimation method that
makes it substantially easier to ensure these conditions are met, thereby
providing a simple framework for designing statistically correct inference
engines.

1.1 0pt

0pt

Probabilistic programming systems (PPSs) allow probabilistic models to be represented in the
form of a generative model and statements for conditioning on data (Goodman et al., 2008; Gordon et al., 2014).
Informally, one can think of the generative model as the definition of a prior, the conditioning
statements as the definition of a likelihood, and the output of the program as samples from a posterior distribution.
Their core philosophy is to decouple model specification and inference, the former
corresponding to the user-specified program code and the latter to an inference engine
capable of operating on arbitrary programs. Removing the need for users to write
inference algorithms significantly reduces the burden of developing new models and
makes effective statistical methods accessible to non-experts.

Some, so-called universal, systems (Goodman et al., 2008; Goodman and Stuhlmüller, 2014; Mansinghka et al., 2014; Wood et al., 2014)
further allow the definition of models that would be hard, or
even impossible, to convey using conventional frameworks such as graphical models.
One enticing manner they do this is by allowing
arbitrary nesting of models, known in the probabilistic programming literature as queries (Goodman et al., 2008),
such that it is easy to define and run problems that fall outside the standard inference
framework (Goodman et al., 2008; Mantadelis and Janssens, 2011; Stuhlmüller and Goodman, 2014; Le et al., 2016).
This allows the definition of models that could not be encoded without nesting, such as experimental
design problems (Ouyang et al., 2016) and various models for theory-of-mind (Stuhlmüller and Goodman, 2014).
In particular, models that involve
agents reasoning about other agents require, in general,
some form of nesting.
For example, one might use such nesting to model a poker player reasoning
about another player as shown in Appendix A.
Here the nesting is essential
because each player has access to information the other player does not and must
carry out their own inference to try and unearth this information.
As machine learning increasingly starts to try and tackle problem
domains that require interaction with humans or other external systems, such as
the need for self-driving cars to account for the behavior of pedestrians, we believe
that such nested problems are likely to become increasingly common
and that PPSs will form a powerful tool for encoding them.

However, previous work in the probabilistic-programming literature has, in general,
implicitly, and incorrectly, assumed that the convergence results from standard inference
schemes carries over directly to the nested setting. In practice,
performing inference for nested queries falls outside the
scope of conventional convergence proofs and so additional
work is required to prove the validity of PPS inference engines for nested queries.
Such problems constitute special cases of nested estimation.
In particular, the use of Monte Carlo (\mc) methods by most PPSs mean they form
particular instances
of nested Monte Carlo (NMC) estimation (Hong and Juneja, 2009).
Recent work (Rainforth et al., 2016a, 2017a; Fort et al., 2017)
has demonstrated the
consistency and convergence rates for NMC estimation for general classes of models.
These results show that NMC is consistent, but
that it entails a convergence rate in the total computational cost
which decreases exponentially
with the depth of the nesting. Furthermore, additional assumptions are
required to achieve this convergence, most noticeably that, except in a few special cases,
one needs to drive not only the total number of samples used to infinity, but also
the number of samples used at each layer of the estimator,
a requirement generally flaunted by existing PPSs.

The aim of this work is to formalize the notion of query nesting
and use these recent NMC results
to investigate the statistical correctness of the resulting procedures
carried out by PPS inference engines. To do this,
we postulate that there are three distinct ways one might nest one query within another:
sampling from the conditional distribution of another query (which
we refer to as nested inference),
factoring the
trace probability of one query with the partition function estimate of another
(which we refer to as nested observation),
and using expectation estimates calculated using one query as
first class variables in another. We use the aforementioned NMC results
to assess the relative correctness of each of these categories of nesting.
In the interest of exposition, we will mostly focus on the PPS
Anglican(Tolpin et al., 2016; Wood et al., 2014)
(and also occasionally Church (Goodman et al., 2008))
as a basis for our discussion,
but note that our results apply more generally. For example, our
nested inference case covers the problem of sampling from
cut distributions in OpenBugs (Plummer, 2015).

We find that nested inference is statistically challenging and incorrectly
handled by existing
systems, while nested observation is
statistically straightforward and done correctly.
Using estimates as variables turns out
to be exactly equivalent to generic NMC estimation and must thus be dealt with on
a case-by-case basis.
Consequently, we will focus more on nested inference than the other cases.

To assist in the development of consistent approaches, we further
introduce a new online NMC scheme and show its convergence rate
is only nominally worse than conventional NMC.
This scheme obviates the need to revisit previous samples when refining estimates, thereby substantially simplifying the
process of writing a consistent online nested inference scheme,
as required by most PPSs.

3.1 Nested Monte Carlo

We start by providing a brief introduction to NMC, using similar notation to
that of Rainforth et al. (2017a).
Conventional \mcestimation approximates an intractable expectation γ0 of a
function λ using

γ0

=E[λ(y(0))]≈I0:=1N0N0∑n=1λ(y(0)n)

(1)

where y(0)n\iidp(y(0)), resulting in a mean squared error (MSE) that decreases at a rate O(1/N0).
For nested estimation
problems, then λ(y(0)) is itself intractable, corresponding to a
nonlinear mapping of a (nested) estimation. Thus in the single nesting case,
λ(y(0))=f0(y(0),E[f1(y(0),y(1))∣∣y(0)]) giving

γ0

=E[f0(y(0),E[f1(y(0),y(1))∣∣y(0)])]

≈I0:=1N0N0∑n=1f0(y(0)n,1N1N1∑m=1f1(y(0)n,y(1)n,m))

where each y(1)n,m∼p(y(1)|y(0)n) is drawn independently and I0 is now
a NMC estimate using T=N0N1 samples.

More generally, one may have multiple layers of nesting. To notate this, we first presume some fixed
integral depth D≥0 (with D=0 corresponding to conventional estimation), and real-valued functions f0,…,fD.
We then recursively define y(k)∼p(y(k)|y(0:k−1)),

γD(y(0:D−1))=E[fD(y(0:D))∣∣y(0:D−1)],and

γk(y(0:k−1))=E[fk(y(0:k),γk+1(y(0:k)))∣∣y(0:k−1)]

for 0≤k<D. Our goal is to estimate γ0=E[f0(y(0),γ1(y(0)))], for which the NMC estimate is I0 defined
recursively using

ID

(y(0:D−1))=1NDND∑nD=1fD(y(0:D−1),y(D)nD)and

Ik

(y(0:k−1))

=1NkNk∑nk=1fk(y(0:k−1),y(k)nk,Ik+1(y(0:k−1),y(k)nk))

for 0≤k<D, where each y(k)n∼p(y(k)|y(0:k−1)) is drawn
independently. Note that there are multiple values of y(k)nk for each
associated y(0:k−1)
and that Ik(y(0:k−1)) is still a random variable given y(0:k−1), while
the overall cost of the NMC estimator is T=N0N1…ND.

As shown by Rainforth et al. (2017a), then if each fk is continuously differentiable
and

ς2k

:=E[(fk(y(0:k),γk+1(y(0:k)))−γk(y(0:k−1)))2]

<∞∀k∈0,…,D, then the MSE converges at rate

E[(I0−γ0)2]≤ς20N0+(C0ς212N1+D−2∑k=0(k∏d=0Kd)Ck+1ς2k+22Nk+2)2+O(ϵ)

(2)

where Kk and Ck are bounds on the magnitude of the first and
second derivatives of fk respectively, and O(ϵ) represents asymptotically
dominated terms.
Theorem 2 of Rainforth et al. (2017a) further tells us that the continuously differentiability
assumption to hold almost surely (rather than absolutely) for convergence more generally,
sure that functions with measure-zero discontinuity points still, in general, converge.

We see that for (2) that if any of the Nk remain fixed, there is a minimum
error that can be achieved: convergence requires each Nk→∞.
As we will later show,
many of the shortfalls in dealing with nested queries by existing PPSs revolve around
implicitly fixing Nk∀k≥1 and taking only
N0→∞.

For a given total sample budget T=N0N1…ND, the bound is tightest when
√N0∝N1∝⋯∝ND giving a convergence rate of O(1/T2D+2).
The intuition behind this potentially surprising optimum setting is that the variance is
mostly dictated by N0 and bias by the other Nk.
We see the unfortunate result that the convergence
rate diminishes exponentially with D. However, this optimal setting of the Nk
still gives a substantially faster rate than the O(1/T1D+1) from
naïvely setting N0∝N1∝⋯∝ND.

3.2 The Anglican PPS

Anglican is a universal probabilistic programming language integrated into Clojure(Hickey, 2008), a dialect of Lisp.
There are two important ideas to understand for reading Clojure:
almost everything is a function and parentheses cause evaluation.
For example, a+b is coded as
\lsi(+ a b) where \clj+ is a function taking two arguments
and the parentheses cause the
function to evaluate.

Anglican inherits most of the syntax of Clojure, but extends it with the key
special forms \sampleand \observe(Tolpin et al., 2015, 2016), between which the distribution of the
query is defined.
Informally, \samplespecifies terms in the prior and \observeterms in the
likelihood. More precisely, \sampleis used to make random draws xt∼ft(xt|Ξt),
where Ξt is a subset of the variables in scope at the point of sampling, and \observeis used to condition on
data gs(ys|Λs) with Λs defined in the same way as Ξt.
The syntax of \sampleis to take a distribution object as its only input and
return a sample, while \observetakes a distribution object and an observation and returns \lsinil, while changing
the program trace probability in Anglican’s back-end.
Anglican provides a number of elementary random procedures in
the form of distribution classes for common sampling distributions,
but also allows users to define their own distribution classes using the \defdistmacro.
Distribution objects are generated by calling the class constructor
with the required parameters, e.g. \lsi(normal 0 1).

Anglican queries are written using the macro \defquery. This allows users to define a model using a mixture
of \sampleand \observestatements and deterministic code, and bind that model to a variable.
As a simple example,

(defquery my-query [data]

(let [mu (sample (normal 0 1))

sigma (sample (gamma 2 2))

lik (normal mu sigma)]

(map (fn [obs](observe lik obs)) data)

[mu sigma]))

corresponds to a model where we are trying to infer the mean and standard deviation
of a Gaussian given some data. The syntax of \defqueryis \lsi(defquery name [args] body) such
that we are binding the query to variable \cljmy-query here. The query starts by sampling μ∼N(0,1)
and σ∼Γ(2,2), before constructing a distribution object
\lsilik
to use for the observations. It then maps over each datapoint and observes it under the distribution
\lsilik.
After the observations are made, \lsimu and \lsisigma
are returned from the variable-binding \clletblock and then by proxy the query itself.
Denoting the data as y1:S this particular query defines the correctly normalized joint distribution (note some queries are not normalized)

p

(μ,σ,y1:S)=N(μ;0,1)Γ(σ;2,2)S∏s=1N(ys;μ,σ).

Inference on a query is performed using the macro \doquery, which produces a lazy infinite sequence of
approximate samples from the conditional distribution and, for appropriate inference algorithms,
an estimate of the partition function.
Its calling syntax is \lsi(doquery inf-alg model inputs & options).

Key to our purposes is Anglican’s ability to nest queries within one another,
as done by, for example, Le et al. (2016).
Though \doqueryis not allowed directly within a \defqueryblock, one can still nest queries by
using the special form \conditionalwhich
takes a query and returns a distribution object constructor ostensibly corresponding to the conditional distribution
(informally the posterior) defined by the query, for which the inputs to the query become the parameters.
We will introduce \conditionalin more depth in the next section, where we will show that its true behavior
deviates from this, thereby leading to inconsistent nested inference schemes.

One of the clearest ways one might want to nest queries is by sampling from the conditional
distribution of one query inside another. A number of examples of this are provided
for Church
in Stuhlmüller and Goodman (2014).1

Consider the following unnested model using
the Anglican function declaration \defm

(defm inner [y D]

(let [z (sample (gamma y 1))]

(observe (normal y z) D)

z))

(defquery outer [D]

(let [y (sample (beta 2 3))

z (inner y D)]

(* y z)))

compared to the following nested query model

(defquery inner [y D]

(let [z (sample (gamma y 1))]

(observe (normal z y) D)

z))

(defquery outer [D]

(let [y (sample (beta 2 3))

dist (conditional inner)

z (sample (dist y D))]

(* y z)))

where we have replaced \defmwith a query nesting using
\conditional. The unnormalized distribution on
traces for the first model is straightforwardly given by

π1(y,z,

D)=p(y)p(z|y)p(D|y,z)

=

\textscBeta(y;2,3)Γ(z;y,1)N(D;z,y2),

for which we can use conventional Monte Carlo inference schemes. The second model
differs in that p(z|y)p(D|y,z) is conditionally normalized
before being used in the outer query

π2(y,z,D)

=p(y)p(z|y,D)=p(y)p(z|y)p(D|y,z)∫p(z|y)p(D|y,z)dz

=p(y)p(z|y)p(D|y,z)p(D|y)≠π1(z,y,D).

The key point here is that the partial normalization constant p(D|y) depends on y and so
π2(y,z,D) is doubly intractable because we cannot evaluate our unnormalized target
distribution exactly. By normalizing p(z|y)p(D|y,z) within the outer query,
\conditionalhas changed the distribution defined by the program.

Another interesting way of looking at this
is that wrapping \cljinner in \conditionalhas “protected” y from the conditioning
in \cljinner (noting
that π1(y,z,D)∝p(y|D)p(z|y,D)) such that the \observestatement only affects the probability
of z given y and not the probability of y. This insight shows us why
the nested inference framework is equivalent to that of sampling from a
cut distribution when there is only a single layer of nesting.

Such nested inference problems fall under a more general framework of inference for so-called
doubly (or multiply) intractable distributions (Murray et al., 2006; Stuhlmüller and Goodman, 2014).
The key feature of these problems is that they include
terms with unknown, parameter dependent, normalization constants. For nested probabilistic programming
queries, this is manifested in conditional normalization within a query.
When there is only a single layer of nesting, these problems
are equivalent to the notion of performing inference in
“cut models” (Plummer, 2015), whereby the sampling of
subsets of the variables is made with factors of the overall likelihood
omitted.

A more realistic demonstration of the utility of defining nested inference problems
is given in Appendix A where we consider the modeling of two poker players.

4.1 Formalization

To formalize the nested inference problem more generally, let y and x denote
all the random variables
of an outer query that are respectively passed and not passed to the inner query.
Further, let z denote all random variables generated in the inner query – for simplicity,
we will assume, without loss of generality, that these are all returned to the outer
query, but that
some may not be used. The unnormalized density for our outer query
can now be written in the form

πo(x,y,z)

=ψ(x,y,z)pi(z|y)

(3)

where pi(z|y) is the normalized density of the outputs of the inner query
and ψ(x,y,z) encapsulates all other terms influencing the trace probability
of the outer query. Now the inner query defines an unnormalized density
πi(y,z) that can be evaluated pointwise and we have

pi(z|y)=πi(y,z)∫πi(y,z)dzgiving

(4)

po(x,y,z)∝πo(x,y,z)

=ψ(x,y,z)πi(y,z)∫πi(y,z)dz

(5)

where po(x,y,z) is our target distribution and
we can directly evaluate the numerator, but the denominator is
intractable and must be evaluated separately for each possible value of
y. Our previous example is now achieved by fixing
ψ(x,y,z)=p(y) and πi(y,z)=p(z|y)p(D|y,z). We can further
straightforwardly extend to the multiple nesting layer setting by
recursively defining πi(y,z) in the same way as πo(x,y,z),
rather than presuming that it is available in closed form.

4.2 Relationship to Nested Estimation

To relate the nested inference problem back to
the nested estimation formulation, we consider using a proposal
q(x,y,z)=q(x,y)q(z|y) to calculate the expectation of some arbitrary function
g(x,y,z) under po(x,y,z) as per self-normalized
importance sampling

Both the denominator and numerator are now nested expectations where
the nonlinearity comes from the fact that we are using the reciprocal of
an expectation.
A similar reformulation could also be applied in cases with
multiple layers of nesting, i.e. where \cljinner itself makes use of
another query. The result can also be directly extended to the sequential Monte
Carlo (SMC) setting by invoking extended space arguments (Andrieu et al., 2010).

Typically g(x,y,z) is not known upfront and we instead return an
empirical measure from the program in the form of weighted
samples which can later be used to estimate an expectation. Namely, if we sample (xn,yn)∼q(x,y) and
zn,m∼q(z|yn) and all return samples (xn,yn,zn,m) (such that
each (xn,yn) is duplicated N1 times in the sample set)
then our
unnormalized weights are given by

where δ(xn,yn,zn,m)(⋅) is a delta function centered on (xn,yn,zn,m).
By definition, the convergence of this empirical measure to the target
requires that
expectation estimates calculated using it converge in
probability for any integrable g(x,y,z) (presuming our proposal
is valid). We thus see that
the convergence of the ratio of nested
expectations in (6) for any arbitrary g(x,y,z), is equivalent
to the produced samples converging to the distribution defined by the program.
The NMC results then tell us this will happen in the limit N0,N1→∞
provided that ∫πi(y,z)dz is strictly positive for all possible y (as otherwise
the problem becomes ill-defined).2

This problem equivalence will hold whether we elect to use
this importance sampling based approach or not, while we see that our finite
sample estimates must be biased for non-trivial g
by the convexity of f0 and Theorem 4 of Rainforth et al. (2017a).
Presuming we cannot produce exact samples from the inner query
or exploit one the special cases (these are respectively considered in
this later in Section 4.7 and Section 4.6), we thus
see that there is no “silver bullet” that can reduce the problem to a standard
estimation.

For example, if we were to use an MCMC sampler for the inner
query then we find that we need to start a new Markov chain for each call,
because each value of y defines a different local inference problem.
Thus we will inevitably produce biased samples for z, giving an asymptotic
bias unless each call of the inner query runs the MCMC sampler for an infinitely
long time; this is thus no better than using importance
sampling. We note that such an approach effectively equates to what
is referred to as multiple imputation by Plummer (2015).

Another possible approach would be to, instead of drawing samples to
estimate (6) directly by NMC, importance sample N1
times for each call of the inner query and then return a single sample from
these, drawn in proportion to the importance weights. However, if
we Rao Blackwellize (Casella and Robert, 1996) the selection of the
sample (which always improves the fidelity
of the estimator),
we find that this recovers (8).

4.3 Shortfalls of Existing Systems

Using the empirical measure (8) provides one
possible manner of producing a consistent estimate of our target
by taking N0,N1→∞ and so we can use this as a gold-standard
reference approach (with a large value of N1) to assess
whether Anglican returns samples for the correct target distribution.
To this end, we ran Anglican’s importance sampling inference
engine on the simple model introduced earlier and compared its output
to the reference approach using N0=5×106 and N1=103.
As shown in Figure 1, the samples produced by Anglican are
substantially different to
the reference code, demonstrating that the outputs do not match
their semantically intended distribution.
For reference, we also considered the distribution induced by the “unnested” model, where \conditionalis replaced
by \cljdefm, and a naïve estimation scheme where a sample budget of N1=1 is used for each call
to \cljinner, effectively corresponding to ignoring the \observestatement by directly returning the
first draw of z. We see that the unnested model defines a noticeably different distribution, while the behavior
of Anglican is similar, but distinct, to ignoring the \observestatement in the inner query. Further
investigation shows that the default behavior of \conditionalin a query nesting context is equivalent to an un-Rao-Blackwellized form of the
reference code but with N1=2, inducing a substantial bias.
More generally, the Anglican source code shows that \conditionaldefines a Markov chain
generated by equalizing the output of the weighted samples generated by running
inference on the query.
When used to nest
queries, this Markov chain is only ever run for a finite length of time, specifically one accept-reject step is carried out, and so does
not produce samples from the true conditional distribution.

Figure 1: Empirical densities produced by running the nested Anglican
queries given in the text, a reference NMC estimate,
the “unnested” model,
a naïve estimation scheme where N1=1, and
the online NMC approach introduced in Section 4.4, for which we
set τ1(n0)=min(500,√n0) and use the same total computational budget
of T=5×109 samples.

Plummer (2015) noticed that WinBugs and OpenBugs (Spiegelhalter et al., 1996) similarly
do not provide valid inference when using their cut function primitives, which effectively
allow the definition of nested inference problems. However,
they do not notice the equivalence to the NMC formulation and instead propose
a heuristic for reducing the bias that itself has no theoretical guarantees.

4.4 Online Nested Monte Carlo

Now that we have established that it will, in general, be necessary to use a
nested estimation scheme for nested inference problems,
it is natural to ask: what behavior do we need for Anglican’s \conditional,
or query nesting through sampling more generally, to ensure consistency?

At a high level, the NMC results show us that we need the computational
budget of each call to the inner query to become arbitrarily large for convergence,
such that we use an infinite number of samples at each layer of the estimator.
However, this will typically be highly inconvenient to actually implement in
a PPS where one usually desires to provide online estimates in the form of
a lazy sequence of samples that converges to the target distribution.
Imagine that we have already calculated an NMC estimate, but now desire to refine it further.
In general, this will require us to increase all Nk, meaning that we need to revisit
previous samples of the outermost query and refine each of their estimates.
This significantly complicates practical implementation, necessitating additional communication between
queries, introducing overheads, and potentially substantially increasing the memory requirements.

To alleviate these issues, we propose to instead only increase the computational
budget of new calls to the inner query, such that earlier call use less samples
than later calls. This removes the need for
communication between different calls and requires only the
storage of the number of times the outermost query has previously been sampled.
We refer to this approach as online NMC (ONMC),
which, to the best of our knowledge, has not been previously considered in the literature.
As we now show, it only leads to a surprisingly modest
reduction in the converge rate of the resultant estimator compared to NMC.

Let τk(n0)∈N+,k=1,…,D be monotonically increasing
functions dictating the number
of samples used by ONMC at depth k for the n0-th iteration of
the outermost estimator. By contrast, the standard NMC formulation will
use Nk=τk(N0) such samples. The ONMC estimator is now defined recursively
using

J0

=1N0N0∑n0=1f0(y(0)n0,J1(y(0)n0,n0))

Jk

(y(0:k−1),n0)=1τk(n0)

τk(n0)∑nk=1fk(y(0:k−1),y(k)nk,Jk+1(y(0:k−1),y(k)nk,n0))

JD

(y(0:D−1),n0)=1τD(n0)τD(n0)∑nD=1fD(y(0:D−1),y(D)nD)

such that later samples of the outermost estimator use more inner samples.
We are now ready to show the consistency of the ONMC estimator.

Theorem 1.

If each τk(n0)≥Alog(n0),∀n0>B for some constants A,B>0 then the mean
squared error E[(J0−γ0)2] converges to zero as N0→∞.

Proof.

Let ^fno:=fk(y(0)n0,J1(y(0)n0,n0))
and let I0(n0)
be a NMC estimator that uses τk(n0) samples at each layer. We have

where as before, O(ϵ) represents asymptotically dominated terms. Here the first two
terms clearly tend to zero as N0→∞. For the last term, we
invoke Kronecker’s lemma which tells us that limN→∞1N∑NnXn=0
if ∑∞n=1Xn/n<∞. Except for the τk(n0), all terms in the sum as constants
that unrelated to n0, such that we have Xn=1/∑Dk=1(ak/τk(n0))2 for some ak.
Thus by using our assumptions on τk(n0), we have
that Xn≤β/(log(n0))2,∀n0>B for some β. Therefore,

∞∑n=1Xnn≤X1+∞∑n=21n(logn)2<∞

(10)

because X1 is finite (remembering τk(n0)∈N+) and the condensation test
tells us that ∑∞n=21/(n(logn)2) is finite. We have thus shown that the last
term in (9) converges to zero as N0→∞, and thus that E[(J0−γ0)2]
also does, giving the required result.
∎

We see that the ONMC approach converges provided that the τk(n0)
increases at a logarithmic or faster rate. Though some sub-logarithmic schedules
still convergence, not all do. For example, τk(n0)=√logn0 does not.

In the case where τk(n0) increases at a
polynomial rate, we can further quantify the rate of convergence.

Corollary 1.

where H2α[N0]:=∑N0n0n−2α0 is the N0-th generalized harmonic number of order 2α.

Proof.

The result follows straightforwardly from
combining our polynomial bound on the τk and (9), then
substituting H2α[N0] in for the resulting
∑N0n0n−2α0 term.
∎

The function H2α[N0] is strictly monotonically increasing for all α
and sublinear for all α>0. The latter property means that our bound
goes to zero, as required to not violate Theorem 1, while the former
means that the convergence rate is slower than the O(N0) of the NMC estimator.
This slower rate is partially mitigated by the fact
that the computation cost of ONMC is less than that of NMC. However, it turns out
that this speed up is only a constant factor, such
that the total computation cost T of ONMC is O(N1+Dα0) if each τk(no)=O(nα0),
as per NMC.
For NMC, this leads to a convergence rate of O(T−1/(1+Dα)+T−2α/(1+Dα)),
which is minimized when α=1/2, giving a rate of O(T−2/(2+D)). For ONMC,
we instead get the convergence rate

O(T−1(1+Dα)+T−2α(1+Dα)H2α[T1(1+Dα)]).

(12)

For α=1/2, H2α[T1/(1+Dα)] converges to log(T1/(1+Dα))
which is O(logT), giving an overall rate of O(T−2/(2+D)logT). Now for any
α>1/2, the first term in (12) will ensure a worse rate than this.
Similarly, for any α<1/2, then the T−2α/(1+Dα) term will again
give a worse rate (remembering that H2α[⋅] is monotonically increasing).
We thus see that α=1/2 is also optimal for ONMC and that, with this setting,
its convergence rate is a factor of logT worse than that of NMC.

To test ONMC empirically, we consider the simple analytic model given in Section 6.1 of Rainforth et al. (2017a),
setting τ1(n0)=max(25,√no). The rationale for setting a minimum value of N1 is that we
want to minimize the burn-in effect of ONMC – earlier samples will be more biased than later samples
and we can mitigate this by ensuring a minimum value for N1. More generally, we recommend setting
(in the absence of other information)
τ1(n0)=τ2(n0)=…τD(n0)=max(T1/3min,√n0) where Tmin is
a minimum overall budget we expect to spend. In Figure 2, we have chosen to set Tmin
deliberately low so as to emphasize the differences between NMC and ONMC. Given our value for Tmin,
the ONMC approach is identical to fixing N1=25 for T<253=15625, but unlike fixing N1, it continues to
improve beyond this because it is not limited by asymptotic bias. Instead, we see an inflection point-like
behavior around Tmin, with the rate recovering to effectively match that of the NMC estimator. We
see that, for this example, the error of ONMC after using T=107 samples is around a factor of two of
that of NMC, which we believe will typically be an acceptable performance drop given the substantial
implementation simplifications it provides.

Figure 2: Convergence of ONMC, NMC, and fixed N1. Results are averaged over 1000 runs, with
solid lines showing the mean and shading the 25-75% quantiles. The theoretical
rates for NMC are shown by the dashed lines.

4.5 Using Online Nested Monte Carlo in PPSs

Using ONMC based estimation schemes to ensure consistent estimation for nested inference
in PPSs is very straightforward – we can simply keep track of the number of iterations
the outermost query has been run for, and use this to set the number of iterations used for
the inner queries. In fact, we do not even need this minimal level of communication – n0
can be inferred from the number of times we have previously run inference on the current query, the current depth k,
and τ1(⋅),…,τk−1(⋅).

When using properly weighted (Naesseth et al., 2015) methods such as importance sampling or SMC,
inner queries can either return a full set of weighted samples each time
they are called or generate this full set of weighted samples, but only return one drawn in
proportion to the inner query weights πi(yn,zn,ℓ)/q(zn,ℓ|yn).
For the former case, the weights in the outer query are unchanged from (7), but N1
is now a function of n. For the latter, these weights are now
wn=ψ(xn,yn,zn)/q(xn,yn) where zn is the returned sample. This latter
case is a strictly inferior estimator, but will often be necessary in practice because of the
semantical complications associated with, for example, returning multiple samples from
a single \sampledraw in Anglican.
Returning to Figure 1, we see that using ONMC with nested importance sampling
and only returning a single sample corrects the previous issues with how Anglican deals
with nested inference, producing samples indistinguishable from the reference code.

Though one can use the results of Fort et al. (2017) to show the correctness of instead using
an MCMC estimator for the outer query, the correctness of using MCMC methods for the
inner queries is not explicitly covered by existing results.
One would intuitively expect the ONMC results to carry over – as N0→∞ all the inner queries
will run their Markov chains for an infinitely long time, thereby in principle returning exact samples
– but we leave formal proof of this case to future work.

4.6 Discrete or Deterministic Input Variables

One special case where consistency can be maintained without requiring infinite computation for each nested call
is when the variables passed to the inner query can only take on, say C, finite
possible values. Of particular note, is the case when only deterministic variables are passed to the inner
query, corresponding to C=1, which, for example, forms the theoretical basis for the “programs as proposals” approach
of Cusumano-Towner and Mansinghka (2018).
As per Theorem 5 of Rainforth et al. (2017a), we can rearranged such problems to
C separate estimators such that the standard Monte Carlo error rate can be achieved.
This is perhaps easiest to see by noting that for such problems, ∫πi(y,z)dz can
only on C distinct values, leading to a separate, non nested, inference problem through enumeration.
For repeated nesting, the rearrangement can be recursively applied until one achieves
a complete set of non-nested estimators. To avoid inferior NMC convergence rates, this special case requires
explicit rearrangement
or a specialist consideration by the language back-end (as done by e.g. Stuhlmüller and Goodman (2012, 2014); Cornish et al. (2017)). For example, one can
dynamically catch the inner query being called with the same inputs, e.g. using memoization, and then exploit the fact that all
such cases target the same inference problem. Care is required in these approaches to ensure the correct
combination with outer query, e.g. returning properly weighted samples and ensuring the budget of the
inner queries remains fixed.

4.7 Exact Sampling

It may, in fact, be possible to provide consistent estimates for many nested query problems without requiring infinite computation
for each nested call by using exact sampling methods such as rejection sampling or coupled Markov chains (Propp and Wilson, 1996).
Such an approach is taken by Church (Goodman et al., 2008), wherein
no sample ever returns until it passes its
local acceptance criterion as a hierarchical rejection sampler. Church is able to do this because it only supports hard conditioning on events with finite probability, allowing it to take a guess-and-check process that produces an exact sample
in finite time, simply sampling from the generative model until the condition is satisfied.
Although the performance still clearly
gets exponentially worse with nesting depth, this is a change in the constant factor of the
computation, not its scaling with the number of samples taken:
generating a single exact sample of the distribution has a finite expected time using rejection sampling which is thus
a constant factor in the convergence rate.

Unfortunately, most problems require conditioning on measure zero events because they include continuous data
– they require a soft conditioning akin to the inclusion of a likelihood term – and so cannot be tackled using Church.
Constructing a practical generic exact sampler for soft conditioning in an automated way is likely to be insurmountably problematic
in practice. Nonetheless, it does open up the intriguing prospect
of a hypothetical system that provides a standard Monte Carlo convergence rate for nested inference.
This assertion is a somewhat startling result: it suggests that Monte Carlo
estimates made using nested exact sampling methods have a fundamentally different
convergence rate for nested inference problems (though not nested estimation problems in general) than, say,
nested self-normalized importance sampling (SNIS).

An alternative way one might wish to nest queries is through nested observation, whereby we
use the partition function estimate of one program to factor the trace probability of another.
In its simplest form, the “observed” variables in the outer program are the inputs to the inner
program. We can carry this out in Anglican using the following
custom distribution object.

(defdist nested-observe

[inner-query inf-alg M][]

(sample [this] nil)

(observe [this inputs]

(log-marginal (take M

(doquery inf-alg inner-query inputs)))))

We can then “observe” another query by calling

(observe (nested-observe

inner-query inf-alg M) inputs)

which will generate a partition function estimate using the inner query
and then use this to weight the trace probability of the outer query.
This could, for example, be used to construct a PMMH sampler Andrieu et al. (2010).

Unlike the nested inference case, this approach turns out to be valid even if our
budget is held fix, provided that the partition function estimate is unbiased,
as is satisfied by, for example, importance sampling and SMC. In fact, it is important to hold the budget fixed to achieve
a Monte Carlo convergence rate. In this nested observation case, we can define our
target density as

po(x,y)∝πo(x,y)=ψ(x,y)pi(y),

(13)

where ψ(x,y) is as before (except that we no longer have returned
variables from the inner query) and pi(y) is the true partition function
of the inner query when given input y. In practice, we cannot evaluate
pi(y) exactly, but instead produce unbiased estimates ^pi(y).
Using an analogous self-normalized importance sampling to the nested inference
case leads to the weights

wn=ψ(xn,yn)^pi(yn)q(xn,yn)

(14)

and corresponding empirical measure

^p(⋅)=1∑N0n=1wnN0∑n=1wn,δ(xn,yn)(⋅)

(15)

such that we are now conducting conventional Monte Carlo estimation, but our
weights are now themselves random variables for a given xn,yn due to
the ^pi(yn) term. However, the weights are unbiased estimates
of the “true weights” ψ(xn,yn)pi(yn)/q(xn,yn) such
that we have proper weighting (Naesseth et al., 2015) and thus
convergence at the standard Monte Carlo rate, provided the budget of the
inner query remains fixed. This result also follows directly from
Theorem 6 of Rainforth et al. (2017b), which further ensures no complications arise when
observing multiple queries if the corresponding partition function
estimates are generated independently.
These results further trivially extend to the repeated nesting case by recursion, while
using the idea of pseudo-marginal methods (Andrieu and Roberts, 2009), the results
also immediately extend to the case where the outermost query uses MCMC based
inference.

Though, as demonstrated, there are no statistical complications with nested observation (other
that the need for unbiased partition function estimates), there are potentially some
semantical complications if we wish to carry out more advanced tasks.
For example, we might desire to observe that the output of another query is a particular value.
Unfortunately, this is not possible in general because the change of variables caused by
deterministic mappings within
the inner program often mean that it is not possible to directly evaluate the density of the
outputs – we rely on the law of the unconscious statistician instead.
One thing that is possible though is to condition on the output of a subset of the sampling
statements returning particular values. This process was first carried out by Rainforth et al. (2016b),
where they use a code transformation to generate an new program that generates the required partition
functions estimates and then use this to construct marginal maximum a posteriori estimators
and PMMH samplers. The idea has since been used in the Turing PPS to construct “composable”
inference algorithms like PMMH (Ge et al., 2018).
Our results confirm the statistical validity of these approaches.

Our final case, is that one might wish to use estimates as first class variables in another query. In our previous examples,
nonlinear usage of expectations only ever manifested through inversion of the normalization
constant. These methods are therefore clearly insufficient for expressing more general
nested estimation problems as would be required by, for example, experimental design.
Using estimates as first class variable in a program allows, in principle, for the encoding of
any nested estimation problem – using these estimates in the outer query gives rise to
nonlinear mappings and expectations taken with respect to the outer query become nested estimation
problem. Thus, inevitably, the validity of such approaches is equivalent
to that of NMC (or ONMC depending on the approach taken) more generally and so must be set
up appropriately. In particular, we need to ensure that the budgets used for
the first class variables estimates increase as we take more samples of the outermost query.

One can implicitly use expectation estimates as first class variables in Anglican
by either calling \doqueryinside a \defdistdeclaration or in a \defnfunction
passed to a query using \cljwith-primitive-procedures, a macro providing
the appropriate wrappings to convert a Clojure function to an Anglican one. An example of the later is shown
in Appendix B, which provides generic Anglican code for carrying out experimental
design problems.

We have formalized the notion of nesting probabilistic program queries and investigated
the statistical validity of different categories of nesting. We have found that current
systems tend to use methods that lead to asymptotic bias for nested inference problems, but that
they are consistent for nested observation. We have shown the conditions required to remedy the former case
and developed a new online estimator that simplifies the process of constructing algorithms
that satisfy these conditions.

Acknowledgements

I would like to
thank Yee Whye Teh for feedback on an earlier draft of the work.
My research leading to these results has received funding from the
European Research Council under the European Union’s Seventh Framework
Programme (FP7/2007-2013) ERC grant agreement no. 617071. Some of the work was
undertaken while I was still at the Department of Engineering Science and was supported
by a BP industrial grant.

Appendix A Case Study: Simulating a Poker Player

As a more realistic demonstration of the utility for allowing nested inference
in probabilistic programs, we consider the example of simulating a poker player who reasons about
another player. Anglican code for this example is given in Figure 3.
In the interest of exposition, we consider the simple situation where both players are short-stacked, such that player 1’s (P1’s)
only options are to fold or go all-in. Presuming that P1 has bet, the second player (P2)
can then either fold or call. One could easily adapt
the model to consider multiple players, additional betting options,
and multiple rounds of betting (for which addition levels to the nesting might be required).

(defdist factor [][]

;;Factorstraceprobabilitywithprovidedvalue

(sample* [this] nil)

(observe* [this value] value))

(defdist hand-strength []

;;Samplesthestrengthofahand

[dist (uniform-continuous 0 1)]

(sample* [this](sample* dist))

(observe* [this value](observe* dist value)))

(defdist play-prior [hand]

;;Aprioronwhethertoplayahandwithoutreasoningabouttheopponent

[dist (categorical

{:fold (+ 0.25(* 0.75(- 1. hand)))

:bet (+ 0.25(* 0.75 hand))})]

(sample* [this](sample* dist))

(observe* [this value](observe* dist value)))

(with-primitive-procedures [hand-strength play-prior factor]

(defm calc-payoff [player p1-hand p1-action p2-hand p2-action]

;;Calculatepayoffgivenactionsandhands.Payoutsarenotsymmetric

;;duetotheblinds

(case p1-action

:fold (if (= player 1) -1 1);;Pickupsmallblinds

:bet (case p2-action

:fold (if (= player 1) 2 -2);;Pickupbigblinds

:bet (case [player (> p2-hand p1-hand)];;Showdown

[1 true] -6 [1 false] 6 [2 true] 6 [2 false] -6))))

(defm payoff-metric [payoff]

;;Hueristicmodelconvertingapayofftoalogscore

(* 3 (log (/ (+ payoff 6) 12))))

(defquery p2-sim [p2-hand p1-action]

;;Simulatorforplayer2whoknow'splayer1'sactionbutnothis

;;hand.Returnsaction

(let [p2-action (sample (play-prior p2-hand))

p1-hand (sample (hand-strength))];;Simulateahandforplayer1

(observe (play-prior p1-hand) p1-action);;Conditiononknownaction

(observe (factor);;Factortraceprobabilitywithpayoffmetric

(payoff-metric (calc-payoff 2 p1-hand p1-action p2-hand p2-action)))

p2-action));;Returnaction

(defquery p1-sim [p1-hand M]

;;Simulatorforplayer1whoreasonsaboutwhatplayer2

;;mightdotochooseanactiongivenahand

(let [p1-action (sample (play-prior p1-hand));;Sampleaction

p2-hand (sample (hand-strength));;Samplehandforopponent

p2-dist (conditional p2-sim :smc :number-of-particles M)

p2-action (if (= p1-action :fold)

:bet ;;Noneedtosimulateplayer2

(sample (p2-dist p2-hand p1-action)))]

(observe (factor);;Factortraceprobabilitywithpayoffmetric

(payoff-metric (calc-payoff 1 p1-hand p1-action p2-hand p2-action)))

[p1-action p2-action])));;Returnactions

(defn estimate-actions [hand M]

;;Estimatestherelativeprobabilityofbettingforahand

(->> (doquery :importance p1-sim [hand M])

(take 10000) collect-results empirical-distribution))

Figure 3: Code simulating the behavior of a poker player who reasons about the behavior of another player.
Explanation provided in text.

The two key functions here are the queries \cljp1-sim and \cljp2-sim, representing simulators for the two
players. The simulations for each player exploit a utility function, which is a deterministic function of
the two hands and actions taken, to reflect the preference for actions which give a bigger return.
P1 acts before P2 and must thus not only reason about what hands P2 might
have, but also the actions P2 might take. To do this, P1 utilizes the simulation for P2,
which takes in as inputs an observed action from P1 and a true hand for P2, returning actions taken
by P2. P2 has incomplete information – he does not know P1’s hand – and so the simulation of P2’s action
requires inference, with a normalization constant that depends on P2’s hand and P1’s action, both of which
are random variables in the outer query \cljp1-sim. This, therefore, corresponds to the nested inference class
of nested estimation problems.

Keen-eyed readers may have noticed that this use of \conditionalin \cljp1-sim is distinct to elsewhere in
the paper as we have explicitly used SMC inference with a provided number of particles M for \conditional. This provides a
roundabout means of controlling the computational budget for calls to \conditional, as we showed is required for convergence in
Section 4. In Figure 6 we see that changing M changes the probability
of each player betting under the model. Increasing M corresponds to P2’s calculation being carried out more
carefully and so it is perhaps not surprising that it becomes less beneficial for P1 to bet with a weak hand as M
increases. Nonetheless, even for a large M it remains beneficial for P1 to bet with less than average hands,
reflecting the fact that money is already invested in the pot through the blinds and the opportunity to win through
bluffing.
The results for M=50 and M=100 are very similar, suggesting that they are close to the true solution.

{subfigure}

[b]0.49
{subfigure}[b]0.49

Figure 4: Probability player 1 betting Figure 5: Probability player 2 bettingFigure 6: Probabilities of each player betting output by \cljp1-sim for different computational budgets of the inner
estimator. The ostensibly strange behavior that P2 is more likely to bet as the strength of P1 hand’s
increases is because \cljp1-sim conditions on having a good return for P1 and when P1 has a good hand,
it becomes increasingly beneficial for that bet to be called. In other words, (b) is explicitly not the
marginal distribution for optimal betting of P2 which would require a separate calculation.

Appendix B Experimental Design Example

An example application of using estimates as first class variables if provided by
Bayesian experimental design (Chaloner and Verdinelli, 1995).
Figure 7 shows generic Anglican code for carrying out Bayesian experimental
design, providing a consistent means of carrying out this class of nested estimation problems.
(Rainforth et al., 2017a, Figure 6) shows the convergence code equivalent to that of
Figure 7 for a delay discounting model.
This shows the convergence (or more specifically lack there of) in the case where M=N1 is held fixed
and the superior convergence achieved when exploiting the finite number of possible outputs to
produce a reformulated, standard Monte Carlo, estimator.
It therefore highlights both the importance
of increasing the number of samples used by the inner query and exploiting our outlined special cases
when possible.

(defm prior [](normal 0 1))

(defm lik [theta d](normal theta d))

(defquery inner-q [y d]

(let [theta (sample (prior))]

(observe (lik theta d) y)))

(defn inner-E [y d M]

(->> (doquery :importance

inner-q [y d])

(take M)

log-marginal))

(with-primitive-procedures [inner-E]

(defquery outer-q [d M]

(let [theta (sample (prior))

y (sample (lik theta d))

log-lik (observe*

(lik theta d) y)

log-marg (inner-E y d M)]

(- log-lik log-marg))))

(defn outer-E [d M N]

(->> (doquery :importance

outer-q [d M])

(take N)

collect-results

empirical-mean))

Figure 7: Anglican code for Bayesian experimental design.
By changing the definitions of \cljprior
and \cljlik, this code can be used as a
NMC estimator (consistent as N,M→∞) for any static
Bayesian experimental design problem.
Here \lsiobserve* is a function for returning the log likelihood (it does not
affect the trace probability), \lsilog-marginal produces a marginal likelihood estimated
from a collection of weighted samples,
and \lsi-¿¿ successively applies a series of functions calls,
using the result of one as the last input the next. When \lsiouter-E is invoked,
this runs importance sampling on \lsiouter-q, which, in addition to carrying out
its own computation, calls \lsiinner-E. This, in turn, invokes another inference over
\lsiinner-q, such that a \mcestimate using \lsiM samples is constructed for
each sample of \lsiouter-q. Thus \lsilog-marg is \mcestimate itself.
The final return is the (weighted) empirical mean for
the outputs of \lsiouter-q.

Footnotes

Even though the nested calls
are made inside the conditioning
predicate for these examples, the probabilistic semantics of Church
means these behave as per nesting inference.

It is important to also take
the convention that the weight is zero whenever πi(yn,zn,1)=0 to avoid
0/0 complications.