Incremental Computation Model Refresher

I originally planned to describe the improved dependency tracking algorithm for
the Rust’s incremental compilation as the follow-up to the
last post.
However, while doing a draft of that, I kept getting confused about cached
results and their exact relationship to nodes in the dependency graph. I think
it’s useful to write down what my mental model for incremental computation in
the compiler is, especially since it’s slightly different than what’s described
in the incr. comp. alpha version blog post
and also in the incr. comp. RFC.

The Model

In order to get a clearer picture, I’ll try to sort out what the central
concepts here are and how they relate to each other. My mental model looks
about like this:

There are computations in the compiler, each of which has a result

There are cache entries which can be loaded from disk

There is the dependency graph consisting of nodes and edges between them

Let’s define those terms in more detail:

Computations

The model views the compilation process as a graph of computations, where
each computation is the application of some function f to a set of inputs,
which in turn are computations themselves. Each computation is a node in this
graph, and if one computation takes the result of another computation as input,
then there is a direct edge between those two nodes. The roots of the
graph are the input values of the compilation process (e.g. source files,
command-line args, etc), which can be viewed as trivial computations, yielding
their result without need for inputs.
Each computation has a result, although that result sometimes is just
success or no success (e.g. borrow checking). Here’s an illustration of how a
compilation session could be modeled this way:

This is of course a simplified example, but in theory the whole compilation
process can be modeled like that.

Cache Entries

The next term on our list is cache entry, i.e. the stuff that’s stored in the
incremental compilation cache directory. In theory this could be anything that
can be loaded from disk in order to help not re-compute something during
re-compilation, but here I’ll define it more strictly: a cache entry is the
serialized result value of one computation node from the computation graph. In
other words, each computation in the compilation can have its result cached and
we can use the ID of the computation to see if there is something cached for
that computation. Note, however, that we don’t require that there is a
cache entry for the result of every computation.

Given a computation graph and a cache, it’s easy to see how this enables
speeding up re-computing the graph: Whenever I reach a computation X, I
can take a look if I already have the result of X in the cache. But how do we
know whether the cached result of X is still valid. That’s where the next
term comes into play.

Dependency Graph

The purpose of the dependency graph is cache invalidation. We want to be able
to ask it “given that these inputs have changed, which cache entries need to
be purged?”
With this in mind,
it might be a bit surprising that there is a separate concept of a dependency
graph. Isn’t the computation graph already the dependency graph? The answer is
yes and no. Yes: it’s a perfect dependency graph, and no: it’s not the
dependency graph that is actually used by the compiler. The reason is tracking
granularity. The number of computations (each of which, conceptually, must have
the properties of a pure function application) would be too high to keep track
of efficiently and often there would not even be a theoretical benefit in
doing so.
Consequently, we try to construct a graph that still fulfills the goal of
proper cache invalidation while not being more fine-grained than it
needs to be. This is possible because it is safe to “conflate” multiple
computation nodes into one dependency node. For example, consider two
computations A and B, with their inputs I1, I2, and I3:

I1----+|vI2-->AI3-->B

Our dependency graph must ensure that A’s cache entry is invalidated if
either I1 or I2 changes, and that B’s cache entry is invalidated if I3
changes. Now let’s say, we replace the nodes A and B with one new node
A&B and associate this node with both A’s and B’s cache entries:

I1-----+|vI2-->A&B^|I3-----+

We still can give the same guarantees: If I1 or I2 changes, A’s
cache entry is purged (because both cache entries are) and if I3 changes
then B’s cache entry is purged (because, again, both are). So we have
effectively traded tracking overhead for getting more false positives during
cache invalidation. (Note that there are even cases where conflation will net
a pure win, where we get rid of unnecessary granularity).

So, we can finally define that the dependency graph is an approximation of the
computation graph such that:

each dependency graph node corresponds to a set of computation nodes, and

if computation node A is in the set of dependency node A', and computation
node B is in the set of dependency node B', and there is an edge between
A and B, then there is also an edge between A' and B'.

I’ll also add the restriction that these sets are disjoint, i.e. that every
computation node corresponds to exactly one dependency node. That’s what one
gets anyway, if one starts out with a computation graph and starts merging
nodes.

Summary

So we have the computation graph, each node of which corresponds to zero or one
cache entries. And we have the dependency graph, each node of which corresponds
to one or more computation nodes. Thus we can link each dependency node to all
corresponding cache entries (which we need to during cache invalidation)
and vice versa (which we need to do during dependency tracking).

dep-nodesD1D2D3/\|/|\computationsC1C2C3C4C5C6|||cacheentriesE1E2E4

Some Examples

Often there is a one to one to one correlation between dep-node, computation, and
cache entry:

But the RFC has procedures and write-edges and whatnot!

when a procedure reads some piece of data, a “read edge” is added from the
top-most procedure node to the “data node”, and

when the procedure writes some piece of data, a “write edge” is added from the
procedure node to the data node.

This sounds quite different, but it’s actually equivalent. There is no actual
difference between “procedure nodes” and “data nodes” (and implementation in
the compiler also never has made this kind of distinction).
It is just a convenient way of mapping the
compiler’s “passes that read and produce data” execution model to a
dependency graph. And there has always really only been one kind of edge in the
dependency graph: a “write edge” is just recorded as a reverse “read edge”.

In order to see that the two models are equivalent, we can provide a mapping
from one to the other:

A procedure node is replaced by a computation node that as result just
has a collection of all the values the procedure node has read edges to.

Each data node is replaced by a computation node that computes its value
from that data in the connected procedure nodes.

I prefer the computation graph model to the procedures-and-data graph model,
since it only has one kind of node and one kind of edge instead of two of both.

What about other Incremental Computation models like Adapton?

There’s quite a bit of existing research on incremental computation. One very
interesting model is Adapton. It looks like it maps well
to our needs and indeed, after reading the papers, I think Nominal Adapton’s
demanded computation graph is pretty much equivalent to the computation graph
as I described it above (where a ref in Adapation is what I called a trivial
computation and a thunk would be a regular computation). That would be pretty
cool because the Adapton people have put lots of effort into proving things
about the model.

Conclusion

With the relationship between dependency graph nodes and cache entries clarified
it should be easier to describe the new cache invalidation algorithm. It will be
interesting to compare it more closely to the Adapton model. I’m sure we can
learn something from them.