One-Hole Contexts Generalize Diff To Containers

by Phil Freeman on 2012/12/21

Text-based diff is of limited usefulness for analysing changes to a large code base over time. I've wondered for a while how one might generalize the diff and patch functions on lists to more general data structures such as abstract syntax trees.

It's pretty easy to write down a version of the diff function for, say, binary trees, but less simple to write a function which works generically across multiple data types. As usual, I spent a while thinking about this before realizing the problem had already been solved ([1], [2]). I'm still quite happy with the approach I came up with, and I think it's sufficiently interesting to write about here anyway.

Longest Common Subsequences and Largest Common Substructures

The diff function can be specified as follows: for lists xs and ys, find the shortest edit sequence taking xs to ys. We can reduce the problem to finding the longest common subsequence of xs and ys: to obtain the least common subsequence, first remove a subset of elements from xs, and then to obtain ys, insert a subset of elements of ys into the least common subsequence.

Ignoring optimization for the time being, let's take this as our starting point. We'll generalize the longest common subsequence to the largest common substructure.

First, let's examine the inductive definition of all subsequences of a list:

In words: if the input list has no elements, return just the input, otherwise, for each subsequence of the tail, yield two lists: the first including the head of the input, and the second excluding it.

Let's recast that definition in the more generic language of containers:

If the input has no subcontainers, return just the input, otherwise, for each substructure of each subcontainer, yield two substructures: the first including the structure of the input, and the second including only the substructure.

The type Rec f is the usual type of recursive data structures of shape f. The Container class contains a few methods which deserve explanation.

The associated data type Context f is the type of one-hole contexts of f [3]. The children function takes an input of type f a and returns an array of the contained a's along with their one-hole contexts. The plugIn method takes an a and a context, and plugs the a into the hole defined by context. Finally, the childAt function takes an input of type f a and a context, and returns the a in the input at that context.

There are some fairly obvious laws which instances of the Container class should satisfy:

The diff algorithm has identified k as the largest common substructure, and the path through s which picks out this substructure discards the third abstraction over the variable z and the corresponding applications.

Applying Patches

We can also define a function patch which applies the result of diff to a structure:

Conclusion

One-hole contexts give a pleasant generalization of diff and patch to containers.

There are still some issues remaining with this implementation, such as the lack of memoization and the incorrect handling of containers with multiple non-recursive constructors, or structures with no common substructures.

It would also be interesting to explore the extension to mutually recursive types, such as the types of statements and expressions in an abstract syntax tree.

References

Package Data.Generic.Diff on Hackage

Generic type-safe diff and patch for families of datatypes by Eelco Lempsick, 2009

The Derivative of a Regular Type is its Type of One-Hole Contexts by Conor McBride