Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?

Update: (November 2015) In the almost seven years since writing this, there has been an explosion of great tools for automatic differentiation and a corresponding upsurge in its use. Thus, happily, this post is more or less obsolete.

I recently got back reviews of a paper in which I used automatic differentiation. Therein, a reviewer clearly thought I was using finite difference, or “numerical” differentiation. This has led me to wondering: Why don’t machine learning people use automatic differentiation more? Why don’t they use it…constantly? Before recklessly speculating on the answer, let me briefly review what automatic differentiation (henceforth “autodiff”) is. Specifically, I will be talking about reverse-mode autodiff.

(Here, I will use “subroutine” to mean a function in a computer programming language, and “function” to mean a mathematical function.)

It works like this:

You write a subroutine to compute a function . (e.g. in C++ or Fortran). You know to be differentiable, but don’t feel like writing a subroutine to compute .

You point some autodiff software at your subroutine. It produces a subroutine to compute the gradient.

That new subroutine has the same complexity as the original function!

It does not depend on the dimensionality of .

It also does not suffer from round-off errors!

To take a specific example, suppose that you have written a subroutine to evaluate a neural network. If you perform autodiff on that subroutine, it will produce code that is equivalent to the backpropagation algorithm. (Incidentally, autodiff significantly predates the invention of backprop).

Why should autodiff be possible? It is embarasingly obvious in retrospect: Complex subroutines consist of many elementary operations. Each of those is differentiable. Apply the calc 101 chain rule to the expression graph of all these operations.

A lot of papers in machine learning (including many papers I like) go like this:

Invent a new type of model or a new loss function

Manually crank out the derivatives. (The main technical difficulty of the paper.)

Learn by plugging the derivatives into an optimization procedure. (Usually L-BFGS or stochastic gradient.)

Experimental results.

It is bizarre that the main technical contribution of so many papers seems to be something that computers can do for us automatically. We would be better off just considering autodiff part of the optimization procedure, and directly plugging in the objective function from step 1. In my opinion, this is actually harmful to the field. Before discussing why, let’s consider why autodiff is so little used:

People don’t know about it.

People know about it, but don’t use it because they want to pad their papers with technically formidable derivations of gradients.

People know about it, but don’t use it for some valid reasons I’m not aware of.

I can’t comment on (3) by definition. I think the answer is (1), though I’m not sure. Part of the problem is that “automatic differentiation” sounds like something you know, even if you actually have no idea what it is. I sometimes get funny looks from people when I claim that both of the following are true.

.

Further evidence for (1) is that many papers, after deriving their gradient, proclaim that “this algorithm for computing the gradient is the same order of complexity as evaluating the objective function,” seeming to imply it is fortunate that a fast gradient algorithm exists.

Now, why bother complaining about this? Are those manual derivatives actually hurting anything? A minor reason is that it is distracting. Important ideas get obscured by the details, and valuable paper space (often a lot of space) is taken up for the gradient derivation rather than more productive uses. Similarly, valuable researcher time is wasted.

In my view a more significant downside is that the habit of manually deriving gradients restricts the community to only using computational structures we are capable of manually deriving gradients for. This is a huge restriction on the kinds of computational machinery that could be used, and the kinds of problems that could be addressed.

As a simple example, consider learning parameters for some type of image filtering algorithm. Imaging problems suffer from boundary problems. It is not possible to compute a filter response at the edge of an image. A common trick is to “flip” the image over the boundary to provide the nonexistent measurements. This poses no difficulty at all to autodiff, but borders on impossible to account for analytically. Hence, I suspect, people simply refrain from researching these types of algorithms. That is a real cost.

Often, derivatives are needed for analysis, not just to plug into an optimization routine. Of course, these derivatives should always remain.

When derivatives are reasonably simple, it can be informative to see the equations. Oftentimes, however, an algorithm is derived for computing the gradient. In this cases, it would almost always be easier for someone trying to implement the method to install an autodiff tool than try to implement the gradient algorithm.

Update: Answers to some questions:

Q) What’s the difference between autodiff and symbolic diff?

A) They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:

function f(x)
y = x;
for i=1...100
y = sin(x+y);
return y

Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.

Q) What about non-differentiable functions?

A) No problem, as long as the function is differentiable at the place you try to compute the gradient.

I think the main reason is your first one – people don’t know about it – followed by a reluctance by researchers to use “black box” software at the heart of their algorithms. If everyone else uses numerical or analytic differentiation and I don’t understand how AD works (and don’t have the time to learn about it) I’m not going to use it.

I can’t speak for the machine learning community, but in the physically based animation community most derivatives are also hand-calculated. A big reason for this is not a lack of knowledge about autodiff, and several papers that published interesting results do appear which uses autodiff. But a major point against autodiff is the removal of the programmer from the code. In performance-critical applications, which most simulations systems are, having hard-coded and optimizable code for your derivatives are crucial. Just by modifying the calculation of derivatives and parallelizing parts of this sequence I’ve managed to produce 200% speed increases in simulation systems. Autodiff would not allow me to manually mess with the code calculating derivatives.

That is interesting that it is used in physical animation. Hand-coding derivatives when performance is paramount seems very reasonable to me. I do this myself for situations where derivatives are easy to calculate, like multi-layer perceptrons. However…

1) I’ve never seen a paper that explained “the reason we are calculating the gradient by hand here is to achieve the best possible performance. (Though, of course, this might just be understood to go with out saying).

2) Even if you do calculate the gradient yourself, what you are going to do is (usually) the same thing as what autodiff does. One could, at least theoretically, use a source-code transformation tool to get a first computation of the gradient, and then start the optimization process from there.

Are there any good references, in published papers, on autodiff? All I’ve seen so far are people’s blog posts. I’d hope there were some textbooks or survey papers, or something. Any pointers to any such thing?

I don’t know of a really great introduction. It is important to keep in mind that there are two major types of autodiff: forward-mode, and reverse-mode. Forward-mode is efficient for functions taking one input and producing many outputs. Reverse-mode is efficient for functions taking many inputs and producing a single output. (That output would be a “loss function” in machine learning)

Thanks for pointing that out. I really like that talk— in particular I find it interesting that he suggests the community is being held back by the complexity of optimizing big powerful models. As he says himself, though, the method he is using of defining “forward-prop” and “back-prop” methods for each module is a special case of autodiff, intended for neural-network-ish models. Traditional autodiff doesn’t require you to specify these methods, and applies more generally. (At least in theory– available packages for autodiff are all somewhat imperfect.)

Justin, thanks for the link to the slides. They are a good introduction, and the picture on one of the last few slides saying when forward- and reverse-autodiff are appropriate explain the difference between the two works best for me.

Check out admb-project.org. ADMB is a nifty open source wrapper and 4th generation language build on C++ that makes it easy to build automatic differentiation into applications. I know the folks promoting ADMB are eager to enlarge the areas of application.

It would really make your clarification clearer if you included, along with your example function, what the code produced by automatic differentiation for it would be. 🙂 (I can guess, and I’m fairly sure, but it would better for you to include it.)

As you’ve discovered, the grandaddy machine-learning algorithm of them all, back-propagation, is nothing but steepest descent with reverse mode automatic differentiation. This means that if you wrote a neural network that simply evaluated, but didn’t know how to learn, and passed it into a routine for optimisation using steepest descent by reverse mode AD, then it would use back-propagation without anyone having to even know that such a thing as back-propagation existed. Internally it would actually perform exactly the same operations as back propagation.

As a consequence, you also get for free how to optimise all kinds of variations on the neural network theme. (Eg. neural networks with feedback loops etc.) There are all kinds of papers out there that do no more than solve a problem that a compiler+optimisation library can solve on their own without human intervention. 🙂

A simple neural network pattern recognition program is one of the examples in the ADMB documentations. See chapter 15 of the AUTODIF manual http://admb-project.org/documentation. If memory serves, it was developed in about 1990.
Cheers,
John

Thanks for all the pointers. I’ve been struggling with this literature for a couple of days now and while I’ve managed to get a couple of the systmes to work for toy problems, I’m still pretty lost.

I get the difference between forward and backward modes and between source-code transformation and operator overloading, but what I can’t figure out is which piece of software, if any, will allow me to compute what we need.

We (in a project led by Andrew Gelman at Columbia) want to sample from the posterior of a multilevel generalized linear models with lots of interactions (producing thousands of predictors organized into dozens of levels, where each level has its own multivariate normal prior, which itself gets a scaled inverse Wishart prior; see, for example, Gelman and Hill’s regression book for simpler examples along the same lines).

Hamiltonian Monte Carlo seems like a good approach because it overcomes the slow mixing of Gibbs sampling when there are lots of correlated parameters. It may also be much more efficient to compute a single gradient several times (for each step in HMC) than to compute conditionals one at a time.

Thus, I need the gradient of the log of a big joint probability function with 2K or so variables, the implementation of which makes calls to matrix libs and stat libs.

We’d like to write everything in C(++) if possible. And it has to have a license that allows redistribution — we want to build a general tool like BUGS or JAGS and distribute it, ideally with a license at least as free as the GPL, but ideally with BSD/Apache licensing.

Awesome blog entry and great links. I think I finally understand automatic differentiation and its scope.

One question: How well does it cope (in theory and in practice) with recursive functions? I parse function expressions from input files and store them in binary tree structures. To evaluate the function I call the evaluate() function recursively and then propagate the values from the branches to the root.

Could automatic differentiation software create a version of the recursive evaluate() to give derivatives? (I guess I could implement the ideas manually, I did something similar to create symbolic derivatives.)

Bob- The best way that I know how to do this would be to use one of the tools that does taping using operator overloading. For example, sacado will give you an “adouble” type. This can then be used in a matrix library that can take user-defined types (I like Eigen for c++). You definitely won’t be able to do autodiff to arbitrary matrix and stats libs, though, unless they can use Eigen and/or adoubles.

I can’t recommend Eigen enough! It would appear to be insanity to re-implement all the standard matrix operations from scratch, but that’s what they did, it improves the user interface hugely, and they have reasonably strong performance data. You don’t even really have to “install” it. You just point to the directory while compiling. I’ve successfully integrated RAD variables (http://www.cs.sandia.gov/~dmgay/) instead of ADOL-C.

If you find that someone has put together the autodiff package you describe, I’d love to know about it. Personally, I think it is the lack of existence of any such package that makes autodiff so underutilized now…

RAD is now part of the Sacado subproject of Trilinos. The bigger project also contains forward modes and some caching modes. But I love how simple the original five-file RAD distro is: one header file, one source file, a makefile, a demo and a README!

Sacado (including RAD) has the cleanest C++ design of everything I’ve seen that’s redistributable. Specifically, you don’t really need to muck about with types or special operations in the code you want to differentiate as long as it sticks to the built-in math functions (see rad.h). Sacado was a pain to install because it’s part of the enormous Trilinos package and has a complex configuration step.

Eigen still looks like a work-in-progress for operations above the BLAS level, but then so do most of the other templated C++ matrix libraries like Boost. They all seem to require you do your own higher-level linear algebra on top of relatively simple libraries for operations like LU factorization.

Boost is also super easy to install and also contains lots of the templated stats functions I need in the Boost Math Toolkit. Unfortunately, it completely falls down in not implementing any of the multivariate distros, not even multinomial or multivariate normal, much less the conjugate prior for multivariate covariance matrices, the Wishart distribution. None of this stuff’s very hard to implement, but it’s error prone if you do it quickly and time consuming if you do it right.

I think I can answer the question as to why more people don’t use auto dif: the tools don’t actually work in practice with enough generality.

I’ve managed to get HMC going with both the original RAD and the newer Trilinos::Sacado::Rad. It works just fine for simple functions I control, like a hand-defined bivariate normal density function.

Unfortunately, neither RAD+Eigen or Sacado+Eigen works to compute gradients of matrix inverses.

First off, neither RAD nor Sacado redefine all the cmath functions. RAD’s missing simple things like abs() and constructors for ints, whereas Sacado’s added these, but is still missing things like floor() [very common in definitions of log gamma functions, as in Boost’s templated lgamma()].

After adding constructors for ADvar(int) and overloading abs() for auto-dif, it (a) computes the wrong answer using ADvar instances, and then (b) seg faults if I try to auto-dif.

Sacado changed the type signatures of RAD so that it’s now impossible to compile an operation like (cond ? x : x*x) when x is an ADvar because x is of type ADvar, but x*x is of type ADvari (RAD defined x*x to be of type ADvar, which is wasteful in memory in some cases). Thus I can’t even compile Eigen matrix inverses with Sacado.

Eric Phipps, one of the developers of Sacado (David Gay’s no longer at Sandia), suggested I wrap a constructor around x*x, writing (cond ? x : ADvar(x*x)), but I can’t (or at least don’t want to) insert these wrappers into Eigen (it’d require template specializations and a whole lot of cut and paste).

We’re considering scrapping auto-dif altogher because we just don’t know how much time it’ll take to get all these libs working together or even if it’s possible.

Wow, brutal! I think I remember patching some of that stuff onto RAD before. If I recall, it was relatively easy to do, just following the templates of the functions that are defined? I could never be bothered to get the whole Sacado package working. It would be something of a pain, but wouldn’t it be possible for you to patch all the stuff you need on top of RAD?

Incidentally, if you could post somewhere your experiences getting this stuff to work together with Eigen it would be really appreciated! There is a way-too-short tutorial on doing ADOL-C with Eigen on the Eigen page, but not much else.

I think I have to agree with you– AD tools just aren’t there as much as they need to be. My impression is that it is just too damn hard to write these tools for C++. Really, what we are trying to do with autodiff is manipulate the abstract syntax tree of the programming language. To do that efficiently, it should be done at compile time, but C++ metaprogramming was never intended for such advanced uses. Maybe if we were doing all our numerical programming in LISP…

But I can’t find any library that can smoothly handle this simple example without major rewrite (obviously this code is small enough to rewrite, but I hope I can solve it without it so I’ll be able to apply it to my real problem).

Does anyone know of an AD library that will be able to handle structs, recursive calls etc?

The short version is that everything works with our own auto-dif implementation but trying to multiply a double Matrix by an auto-dif matrix.

This was prompted by a discussion on their mailing list. Eventually, I’ll need to override the simple implementation to vectorize the gradient calcs for the matrix ops. Some of them have fairly easy to compute gradients.

Another issue is with the way Eigen uses template expressions for lazy evaluation. Even though I overrode matrix multiplication for double times auto-dif, I couldn’t override their intermediate classes, like the transpose expression class. That’s why I finally got in touch with them — they’re working on a fix.

RAD also had problems with seg faults. And it’s not implemented very efficiently. And there’s no license info and the author’s not responding to e-mails (though Sacado has an active mailing list and Eric Phipps responded right away).

Matt Hoffman and I found it easier to rewrite our own AD from scratch. We made it more OO, do the operator new stuff a bit more cleanly, expose proper namespaces, etc. We pretty much followed RAD’s basic design, but we used a stack for the derivative propagation with a virtual base class to apply the chain rule.

We’ll be releasing our auto-dif under a BSD license along with the rest of our Hamiltonian Monte Carlo implementation (I’m working on a compiler for BUGS-like languages — more fun with templates via Boost’s Spirit Qi parsing framework).

Sacado changed the output types in some cases to not match the input types, so that it won’t work with many operations, like the ternary operator. C++ gets confused on how to do the conversions. The only way to get around that is to change the code that’s being auto-diffed, which isn’t exactly an option for me for Eigen!

Our version’s several times faster than RAD and Sacado, and even a bit faster than CppAD’s reverse mode in single-threaded mode (they’re neck and neck in thread safe mode), though remains fully OO and user extensible. I also implemented all of the cmath and C99 libs for reverse-mode auto-dif.

There are certainly projects working on functional programming versions of auto-dif. For instance, Noah Goodman at Stanford is working on this in the context of the Church system.

It’s definitely more efficient to code generate than to do everything with templates. But that’s rather daunting for a language as complex as C++!

There’s no problem if you template out the scalars in your structs. Then you can get what you already have by instantiating with double and you can get an auto-dif version by instantiating with an auto-dif variable.

It’s in Python. It allows you to define a graph of computational operations based on tensor (mostly matrix, vector) inputs. As most ops have a C (or even CUDA) implementation, it can weave that symbolic graph into C code and compile it, allowing fast execution on CPU (and most of the time GPU too). It’s used extensively for machine learning at the lab I studied in, which is where it originates from (LISA, UMontreal).

Notably, concerning automatic differentiation: it can automatically compute the gradient of the function (which is another graph, also compilable/executable). It does this by chain rule, and because each op specifies how to compute the gradient of its outputs relative to its inputs (just like you mention).

But concerning not being able to “differentiate algorithms”: there’s the Scan operation which allows you to symbolically specify loops, and differentiate the result. I haven’t used that Op myself, though, but I’m pretty sure it covers most basic cases.

It has a BSD license and we’re hoping people feel free to use all or parts of it in their own work.

It’s written in C++ with taped reverse-mode auto-dif implemented with templates, and fully integrated with all the C++ special functions and with Eigen and with our own probability function library. Our auto-dif’s both more extensible and faster than CppAD or Sacado. I still need to write doc on the auto-dif itself and then put a link up on autodiff.org and on the Wikipedia page; but if you know auto-dif, it’s very straightforward, and our unit tests provide extensive simple examples of its use.

It also has a programming language like WinBUGS that lets users express models more naturally, including expressing constraints for which Jacobian adjustments are automatically calculated (and auto-diffed).

Oh, and there’s a complete R interface (Python, MATLAB and Julia are on the to-do list, too). There’s a 250 page manual with many statistical modeling examples and a full definition of the language and its functions.

The next release (out in a week or two) will add comparison operators, logical operators, and Newton-Raphson optimization. We’ll probably put in the quasi-Newton L-BFGS sooner or later. We’re also planning to do forward-mode templated auto-dif in the usual way so that we can calculate Hessians for both optimization and Riemann Manifold HMC. We got an NSF grant to continue work on it, so we’ll be able to continue to add to its functionality and support it.

P.S. Even for pretty hairy functions, our compile times are a lot better than you reported in the follow-on post. But they’re still pretty bad. clang++ is much better than g++ in terms of compile times.

First off, thank you very much for this site. I gives ignorant people like myself a nice way in and loads of good pointers. Coming from an engineering background (AD is not known in my field at all) where gradient based solution procedures are quite common, and considering massive systems of up to millions and more variables, my main concern is speed at run-time. So here is my (maybe naive) current assumptions

A) am i wrong to assume that AD, given a well written initial function f(x), will create code which will be very similar to what i would have done on paper anyhow and then typed in similarly? Though i am aware that further optimisation choices can be done of course in the case of manual code.

B) Considering the following paper which compares AD to symbolic derivation (SD), http://link.springer.com/article/10.1023%2FA%3A1015523018029#page-1
it is apparent that AD over SD wins in everything from ease of usage, memory requirements, flexibility of application, compile and link time. However the crucial part is the “execution-time” or run-time that i takes for calculating a certain number of numerical values of the derivatives. In an iterative solution scheme with many steps and many iterations, the run-time performance is paramount.

Any advise on current (the paper is 10 years old) numerical performance at run-time of AD as compared to maybe SD obtained code.

I have used Adept and found it to work quite well. A few caveats: (1) obscure functions such as cbrt() are sometimes missing (so I’m not sure how well it would interoperate with say template matrix libraries), and (2) because AD memory is proportional to the number of operations, in practice it is necessary to clear the tape periodically to avoid accidentally using too much memory.

Does anyone have a reference for the precise time complexity of reverse mode AD? It seems like if you have a function of n variables that uses m arithmetic operations (prior to the use of AD), then it should take something like O(n + m) time. However I’m unable to find a reference for this.

If the expression graph has N nodes and M edges, it’s O(N+M), assuming the cost of calculating the result of each operation is constant and the cost of calculating each partial is constant. The constant factor depends on the implementation, of course.

Autodiff is not always a silver bullet when computing derivatives of a cost function. First of all, it will take a lot of memory and time if the problem size gets large because the computation graph built by AD. Manually deriving analytic derivatives, when possible, should be favored. Second, AD requires continuous functions and no conditional branches. Even though sometimes you can still apply AD regardless of the property of the function, you will probably end up with a local minimum or some erroneous solution.
The manually derived derivatives in those papers may only serve the purpose of program efficiency. A paper should be judged on its novelty of problem formulation, however, instead of on the derivation effort made, unless the derivation is on a famous unsolved problem.

Absolutely. Usually you wind up with something like 8 to 16 bytes per expression if you’re doing reverse-mode autodiff for effcient gradients. Forward mode is less efficient for gradients, but doesn’t introduce much memory overhead at all.

Manually deriving gradients is painful and error prone for large models, but when you can do it and then implement the reasonable dynamic programming to propagate the chain rule through a complex expression (the adjoint calculation in autodiff), it should be faster (maybe by a factor of 2 to 5 or even more if there’s more structure to exploit).

There’s no restrictions against branching in most templated C++ autodiff systems. Certainly not in Stan, Trilinos/Sacado, CppAD, or Adept. The problem with branching is that it usually introduces discontinuity and/or non-differentiability, which usually violates whatever assumptions are being made by the calling function asking for derivatives. For instance, if we step past a discontinuity in our Hamiltonian dynamics simulator in Stan, we’re violating the energy conservation properties of the entire algorithm.

Stan’s autodiff can be used standalone, but we don’t have separate doc on how to build it or use it. We’re actually working on that now, because lots of people are asking for it because of the extensive library we implement of linear algebra and probability functions.

Stan’s faster than everything we’ve benchmarked (Sacado, CppAD, Adept), and we’ll put comparisons in the paper. I don’t have experience with Ceres. Is there a reference to their autodiff somewhere?