Saturday, April 30, 2011

Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:

d (\x -> (d (x*) 2)) 1

Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:

d (\x -> (d (x*) 2)) (1::Integer)

It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.

So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.

However I am now completely convinced the situation is more complex than this and I'll have to address it again.

Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.

So I'll leave the following in place because removing it would be bad form and some of it still stands.

Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".

This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.

So let's go ahead and implement AD and try to reproduce the problem they point out.

The Haskell type checker identifies precisely where the problem is. x and y don't have the same types so they can't be added. Rather than immediately analyse how to fix this, let's try a completely different approach to calculus: symbolic differentiation. We'll define an expression type and write code to differentiate it

the outer d' differentiates symbolically so internally it uses an expression type. This means that x is an expression, not a numerical value. But this means that the inner d' is being asked to evaluate its result at a value that is itself an expression, not a value. So internally it uses expressions of expressions. So x is of type E a and y is of type E (E a). It's no surprise we can't add them. Our bug is easily fixed. We define a lifting function:

Much the same discussion applies to the AD code. In that case, D a is the type of a's that have been infinitesimally perturbed. The type D (D a) is a value whose perturbation is itself perturbed. But Siskind and Pearlmutter say that we might be subject to perturbation confusion and could get 2. Curious. There is no way we can 'confuse' these distinct perturbations. Not no how. Not no way. They correspond to data of entirely different types. Far from being fraught with danger, the Haskell type system keeps us safe. Perturbation confusion in Haskell is about as likely as expression tree confusion.

Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.

We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.

Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.

Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.

Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.

I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.

(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)

Saturday, April 09, 2011

Introduction
I've not said much about my work in visual effects in this blog. This is mainly because I try very carefully to avoid any kind of intellectual property conflict with my employer. But now I've left the world of visual effects to work at Google, I think I might throw in the occasional article. So here's something about my own work from many years ago. In particular, back in the year 2000 I and my colleagues George and Kim received an Academy Scientific and Technical Achievement Award for "the development of a system for image-based rendering allowing choreographed camera movements through computer graphic reconstructed sets." I rarely tell people what this work was for. But now I'll have a convenient web page to which I can direct anyone who asks. I'm hoping to aim this at people who know a little mathematics and geometry, but not necessarily anything about computer graphics.

Many years ago I worked at a company called Manex Visual Effects. One day the story of that company needs to be told. Here are some things that have been alleged about Manex that I neither confirm nor deny: its stock was traded in one of the largest ever pump-and-dump scams in the Far East, it extracted millions of dollars from Trenton, New Jersey for a fake scheme to create an East Coast "Hollywood", and it spent a couple of years making press releases about how it was doing work on the Matrix sequels despite the fact that there was nobody at the company who was actually doing work on the movies, including making press releases quoting me describing the ongoing work long after I had left. At one point I was apparently being trailed by a private detective who even came rowing a boat past the back of my house in order to get pictures of me. I haven't checked the fine print, but I think the contracts that prevented me from speaking about any of this expired long ago.

But back around 1998-2000, when Manex was doing real effects work, we developed a pipeline for a technique known as image based rendering. We became adept at taking large numbers of photographs of locations and then reproducing those locations as photorealistic renders. When I say photorealistic here I don't mean something that's merely supposed to look real, but actually looks fake. I mean renders that were indistinguishable from photography. That's commonplace now but back in 2000 it was a challenge, especially on a large scale.

The last ten seconds of this video clip from MI:2 should give some idea of what I'm talking about. The city around Tom Cruise as he jumps from the building is entirely CGI, but using real photography:

Texturing Cities
Although MI:2 isn't set in Sydney, that is the location that was used. A team returned with many hundreds, if not thousands of photographs of the central business district. We also managed to construct 3D models of these buildings by various means. We started by using Façade to construct the geometry, but ultimately we used a variety of methods including buying 3D models from, I think, the city itself. In later years I completely replaced the way we built geometry so we didn't need to use any other source.

The goal was to map the photography onto the models. Here's a picture illustrating three points being mapped onto a building. To render correctly we needed every visible point on every 3D object to be coloured (correct terminology: textured) using a pixel from one of the photographs.

The Matrix
We're ready for a theorem. Let P be a photograph of a scene S. Let (u, v) be ordinary rectilinear coordinates in P. Let (x, y, z) be 3D coordinates in the scene S. Define proj(u, v, w) = (u/w, v/w). Then there is a 3×4 matrix M such that for every point (x, y, z) visible in S, its colour is given by the colour of the point with coordinates proj(M(x,y,z,1)T) in P.

(This assumes a pinhole projection model for the camera. In fact, real cameras have lens distortion. I wrote software to measure and fix this.)

The important point is that for each camera we just needed one matrix. We generated it by a very simple scheme: we had artists mark correspondences between points on our 3D models and points in the photography. I implemented a least squares solver to find the matrix that best fit each view. Pity the artists. They came fresh from college with well developed 3d modelling skills and I was responsible for reducing their careers to this simple point and click operation. But don't feel too bad. Most of these artists have since gone on to very successful careers in visual effects. They were a really great team.

But just mapping a single photograph is no good. The theorem tells us what to do for each point visible in a photograph, but how do we select which photograph to use? When producing the final render for production we had the luxury of being able to allow hours to render a single frame. We could make the decision on a pixel by pixel basis by performing a test for visibility from each camera, and then using heuristics to make the selection. (My colleague George had already proved this worked for the backgrounds to most of the bullet-time shots from The Matrix.)

But that was no good for the artists. They needed to see their work as they were doing it. If we could get the render time down to seconds it would be good. In fact, we got it down to a fraction of a second.

The Splitter
The first thing to note is that if we render a single triangle then the mapping defined by the theorem above is provided by modern graphics hardware. Back in 2000 it wasn't universally available, but it was in the SGI hardware we used. So we only needed to solve the camera selection problem fast. The solution was simple, we'd preprocess the geometry of the scene by splitting it into pieces (reducible to triangles), each piece corresponding a a part visible from one camera. Debevec et al. had already given demos of a technique for approximately partitioning a scene between cameras like this, but it didn't have the quality we wanted.

(By the way, my career in graphics started when I figured out an algorithm for eliminating the division from the proj() function and my new boss-to-be noticed I'd posted it online.)

There were three things needed:

1. For each camera, we needed to compute the subscene that was the intersection of the scene S with the volume viewed by the camera, its frustum.

2. For each subscene we needed to remove any parts occluded from view by any other part of the scene.

3. Ensuring that no two subscenes overlapped - ie. ensuring that no part of any subscene is visible in more than one camera.

It seemed like a messy computational geometry problem. But after a little thought it dawned on me that most of the geometry work, at least as measured by computation time, could be factored into one tiny geometry routine and the rest was logic. Here's a picture of the algorithm:

It takes a convex polygon and slices it using a plane (indicated in grey). Now everything else follows. For example, step 1 above can be achieved like this: the frustum associated with a camera is a 6-sided convex polyhedron (or a 4- or 5-sided polyhedron if you want it to extend all the way to infinity and/or all the way to the point of projection in the camera.) We can decide which points are in the frustum by slicing using the 6 planes and keeping the pieces that fall inside.

Step 2 works a little like this:

There is a 5-sided polygon lying in front of a box casting a shadow. The shadow volume is itself a 6-sided frustum (with a seventh side "at infinity", so to speak). So to remove the parts shadowed by a polygon we use the same slicing algorithm and collect up all the pieces that fall outside of the shadow volume. To remove all of the scene that is occluded from view by this camera we simply remove shadow volume corresponding to every single polygon in the scene. One of the virtues of image based rendering is that the detail in the photography makes up for the simplicity of the geometry, keeping the polygon count low. So the total number of operations might not seem too bad. Unfortunately, every time you slice through the scene you risk doubling the total number of polygons. It could have taken worse than exponential time in the number of polygons. But I took the risk, and surprisingly the time to preprocess was measured in minutes for typical scenes. The individual pieces of geometry were very intricate due to the fact that occlusion from any object could carve into a building. But like a jigsaw, the union of all the pieces gave something that looked just like a real city.

The last stage was ensuring that each part is visible from one camera. This was straightforward. As every polygon was processed with respect to every camera then I could associate to every polygon a list of cameras in which it was visible. At the end I could just sweep through and pick the best camera based on a heuristic. Typically we wanted to use the camera with the most undistorted detail.

Along the way I had to apply a few more heuristics to make things manageable. Many thin slivers would appear in the slicing. If they were small enough I threw them away. I'd also sweep through from time to time and fuse together neighbouring polygons that had been sliced but ended up still being visible in the same cameras, and whose union was still convex. That would reduce the total polygon count and speed up further processing.

It worked. The artists would mark their correspondences, run the optimiser to extract the per-camera matrices, kick off the 'splitter', have a coffee, and then return to a fully interactive view of Sydney, or wherever. They could now examine it from all angles and quickly ascertain what further work was needed, eg. if there were holes in one of the buildings. That allowed us to develop an efficient workflow and carry out the work on the scale needed to complete movie shots. We could also interactively visualise how other objects interacted with our cities. We could also use this tool to plan the photography and ensure we had all the coverage we needed.

In retrospect it all seems trivial. But at the time I don't think any other company could churn out entire city blocks the way we could. Today, Google do similar work on a much larger scale, and much more cleverly, with Street View.

And that was just one part of the technique that we marketed as "virtual cinematography" and which won George, Kim and I our award. But it's really important to remember that movie making is a big team effort. It took the work of many dozens of people to create our photorealistic cities, and the fact that I've chosen to write about my contribution doesn't mean that it was the only part that was important. It wouldn't have happened if George hadn't taught us the method originally, and if Kim hadn't believed in us and formed part of the company around our pipeline. And of course nothing wouldn't have happened if the artists didn't politely put up with our crudely engineered tools.

Ackowledgement
The example city image above was derived from a picture by Calvin Teo on wikipedia. I don't have the original photography we used.

Saturday, April 02, 2011

Interpolating Between Propoitions
Suppose a, b and c are propositions. Then a∧b→b∨c. It seems that the a is irrelevant to what's happening on the right hand side. We could remove it and still have a true proposition: b→b∨c. It seems that the c on the right hand side is also irrelevant so that this is also true: a∧b→b. In fact, we can "refactor" the original proposition as a∧b→b→b∨c. (By Curry Howard it *is* a kind of refactoring.)

Let's try using the methods I described in Part I to demonstrate the validity of a∧b→b∨c:

I've done two things slightly differently. I brought the negation inside the implication at the start. This means we start with an implication with a clearly defined left- and right-hand side. It makes no difference to the outcome. But I've also coloured terms coming from the left-hand side in bLue and those on the right in oRange. Eventually we see a b and a ¬b meet each other resulting in the big red X telling us that the negation of a∧b→b∨c is invalid. But notice how the b and ¬b came from opposite sides of the implication. This tells us that the implication is valid because of the b's and that the appearance of a and c plays no role in establishing validity.

Now suppose a blue a met a blue ¬a. They would have both originated from the left hand side, telling us that the left hand side was a contradiction, regardless of what's on the right. So if the original implication we wished to establish were written as L→R then we'd have established that we could factor it as L→⊥→R. Similarly, if an orange a met an orange ¬a we'd have established the invalidity of ¬R meaning that we get L→⊤→R.

In all three cases we've managed to find an "interpolating" formula F such that L→F→R with the property that F only refers to letters that occur *both* in L and R. It may seem intuitively obvious that we can do this. Irrelevant hypotheses shouldn't play a role in establishing an implication. In fact, this is generally true of propositional calculus and is known as the Craig Interpolation Lemma. It also holds for Provability Logic. This is certainly not an obvious fact. The interpolation property fails for many logics.

I'm only going to roughly sketch how the proof of the interpolation lemma looks. Essentially it will be a proof by construction with the construction being the following code. We'll implement a bunch of tableau rules that construct the interpolation lemma. We've also seen what some of these rules look like in the discussion above.

Constructing Interpolations

I coloured propositions orange and blue above. I can't use colour in Haskell code so I'll instead label propositions L and R using the following type:

> data SignedProp a = L a | R a deriving (Eq, Ord, Show)

Think of the colour or sidedness of a proposition as its 'sign'. Sometimes we'll make a selection based on the sign:

Now you may see why I had to use Haskell's ViewPattern extension. I want to do the same pattern matching as before on these propositions even though they are a different type. As all the rules view the patterns through propType we can achieve this by making SignedProp a an instance of PropTypeable:

With this framework in place the code to interpolate is quite small. Instead of returning a Bool or a diagram, these rules return Just the interpolating proposition when possible, otherwise a Nothing. But the code is also going to do something slightly more general. If we close a potential world containing propositions l1,l2,...,r1,r2,... we have shown that l1∧l2∧...∧r1∧r2∧... isn't valid. Ie. that l1∧l2∧...→¬r1∨¬r2∨... *is* valid. If the li are L-propositions, and the ri are R-propositions, then the rules I define below will find an interpolating formula for l1∧l2∧...→¬r1∨¬r2∨.... For the original case of p→q we prime it with p on the left and ¬q on the right:

The algorithm needs to know when a world has closed. The original validity rules returned a boolean. These rules return a Maybe, and we know a world closed if the algorithm succeeded in returning an interpolating proposition:

Now suppose we want to find an interpolating formula for l1∨l2→r. If we set up our tableau we get:

(Note I'm using l and r as metavariables here, so l1, l2 and r represent propositions made up of (possibly many) ordinary single letter variables.) If we complete the two sides of the divide using our interpolation algorithm recursively we'll find propositions fi such that li→fi→r. Hence we find l1∨l2→f1∨f2→r. Clearly the middle proposition only contains letters that appear on both sides of the original implication. A similar analysis allows us to find an interpolating proposition for l→r1∧r2. That gives us this rule:

> disjRule = \p _ _ tl tr -> select p (:\/) (:/\) <$> tl <*> tr,

Those are the rules for classical propositional calculus.

Now comes the tricky bit. If we draw a big red X in a subworld it allows us to back out and deduce an interpolating proposition in the parent world. The rule is simple but I'll leave the proof sketch as an appendix:

> processWorld = \p t -> select p Dia Box <$> t,

If any subworld of a world is closed, so is the parent world, so we can ignore all but the first closed subworld:

In my previous post I talked about how in provability logic we can define propositions implicitly. Now we're in a position to do the construction. Firstly we need to say precisely what we mean by a definition inside the language of provability logic, and then we need to say what ensures that definitions make sense.

An implicit definition of a proposition in provability logic is a function Prop -> Prop that doesn't analyze its argument. Informally, it defines a proposition p if f p is valid. Some candidates might be:

> def1 p = T --> p
> def2 p = p --> T

The first, def1, seems fine. It seems to uniquely single out p == T because def1 T is valid. But def1 (neg F) is also valid. So we can't uniquely pin down propositions, but only up to some sort of equivalence. In fact, we can use <-> as our equivalence relation.

But the second attempted definition is useless. Any proposition satisfies it. So we only consider a definition d to be valid if any two propositions satisfying it are equivalent. So we can say that d defines a proposition h if d p --> (p <-> h) is valid.

So here's the beginning of a function that attempts to find a proposition satisfying an implicit definition:

You can see how I've encoded our precondition for a definition to be good. I also used __p and __q to make sure we didn't clash with any letters in the definition. It's called beth because the theorem that ensures we can write the else clause is known as the Beth Definability Theorem. The last line is astonishingly short:

> else interp (d p /\ p --> (d q --> q))

Suppose we have proved that d(p)∧d(q)→(p↔q). It immediately follows that d(p)∧p→(d(q)→q). (It's essentially a bit of currying.) This is a candidate for Craig interpolation as the left hand side has no q and the right hand side has no p. So we can make a sentence h so that d(p)∧p→h and h→(d(q)→q)). The lettes p and q are just letters. If a proposition is true for q, it's also true for p. So with a little rearrangement we get that d(p)→(p↔h). Or in English, if p satisfies our definition, h is equivalent to it. So we've constructed an h that does what we want.

Fixed Points
Now it's a small step to get a fixed point. We just make a definition of fixed point and apply beth. This looks like it might work:

> isFixedPoint' f p = p <-> f p

Unfortunately it's possible for p and p' to be inequivalent and yet both satisfy isFixedPoint' f. We have to use a "stronger" definition. I'll leave the proof to Boolos's book, but what we'll do is assert not just that p is a fixed point, but also that it is provably so. So we use:

> box' a = a /\ Box a
> isFixedPoint f p = box' (p <-> f p)

In order for the uniqueness to work, every occurence of the argument of f must be inside a Box or Dia.

And that's it. We can now churn out fixed points to our heart's content.

Gödel's Second Incompleteness Theorem again
Let's find a proposition that asserts its own unprovability:

> godel = fixedpoint $ \p -> neg (Box p)

We get Dia T, which is the claim that arithmetic is consistent. So godel shows that if arithmetic is consistent, then it can't prove its consistency.

Here's a large number that I use for regression testing that I lifted from Boolos's book:

I am disconcerted that one of the easy looking examples in Boolos fails my tests. Given that the difficult cases agree with my code I think it is likely an error by Boolos, though that's a scary claim to make against one of the best known logicians in the world.

The company Theory Mine has been selling theorems. But they don't seem too interesting. Far better to generate theorems that generalise Gödel's work and tell you about the very nature of provability. Just send me $19.99 and I'll send you a certificate.

> main = do
> regress1
> regress2

Notes
Here is some Prolog code by Melvyn Fitting to do much the same thing. Note that code relies crucially on backtracking whereas my code explicitly searches through subworlds.

The function runTableau is a kind of generalised fold. Such things have associated induction principles. In this case it's a generalisation of the "Unifying Principle" defined by Smullyan in First-Order Logic. I guess it is an ordinary generalised fold over a tree structure representing fully expanded tableaux, but we never explicitly build such a structure.

Oh...and apologies (if anyone has come this far) for how long this took. Last year I threw together code to find fixed points in a couple of hours thinking "that was easy, I can blog about it". But I hadn't realised how many hours of work it would take to explain what I had done. And even if nobody came this far, the rubber ducking improved my own understanding greatly.

Appendix
Suppose we successfully showed that this closed:

Then the generalised Craig Interpolation Lemma says that for some suitable f, both of the following close:

The way we'll prove this is to follow what we must have done to get the first potential world to close, assume inductively that the interpolation lemma works for all of the potential subworlds that we entered.

So let's start by supposing that we used ◊l2 to open up a subworld, giving us:

If the subworld closed then recursively using interpolation we know that for some appropriate choice of g, these closed too:

Now consider trying to close these two worlds:

Using Worlds 1 and 2 above we find they close immediately:

And that now tells us that when we recurse back up, we can use f=◊g.

I started with world containing just 6 propositions so I seem only to have proved this for that case. But in fact, any combination of propositions just ends up with an argument that is substantially the same. The only thing that might change is that if we use ◊r2 to try to open up a subworld we find that f=◻g for some other g. And this is all summarised compactly by the rule processWorld above.