Introduction

In some cases, we need to use mathematical tools and equations to solve our real-world problems. Unfortunately, there are many cases when we cannot use the simple pen-and-paper techniques we learned in high school (yes, they lied to us in high school!) to solve our problem and we need to rely on numeric methods. For example, the following simple problem cannot be solved using pen and paper:

e^(-x) + x - 2 = 0

Many numeric methods rely on the ability of their user to compute gradients (partial derivatives) of functions. This is a tedious task that can usually be done automatically by a computer, and we will demonstrate the AutoDiff library achieves this objective. It allows the developer to concentrate on her business logic - modelling the problem she is trying to solve, and saves the time wasted on manually developing the gradients.

Background

Automatic Differentiation (AD) is a very old technique and may refer to quite different things. But all have one thing in common - all methods are given some representation of a function and allow computing the gradient at any given point. More information on various AD techniques can be found here.

Unfortunately many people are not aware of it and worked very hard to efficiently compute derivatives in order to develop gradient-based algorithms. One very well known example is the back-propagation algorithm for neural networks. When using AD, there is absolutely no need to develop a learning algorithm. Any AD library can efficiently compute the gradient for you, and you just need to use its results to update the weights.

The AutoDiff library described in this article takes the operator overloading approach. Expressions (or terms) are represents using objects. Operator overloading allows creating new terms from existing ones. Once a term is defined, the library allows to efficiently compute its value or gradient for a given assignment for the variables.

Using the Code

A Quick Introduction

After you have downloaded the library, you can create a C# project, add a reference and start playing! First, let's give a quick "hello world" introduction. Here is a small program that defines a function and evaluates its value and gradient.

More generally, we build our functions using objects of the base class Term. Variables are terms too.

We create new terms from existing ones using operator overloading and the TermBuilder class. See the definition of func in the snippet above.

The extension methods Evaluate and Differentiate are using to compute the value and the gradient, respectively, of a function defined by a given term.

AutoDiff on Steroids

The functions Evaluate and Differentiate are similar to running interpreted code. They work, but they are slow. We can speed up the computation by "compiling" our function and use the compiled version to efficiently evaluate or differentiate functions. Compiling is beneficial when we need to compute gradients of the same function multiple times. Fortunately, this is exactly what happens in most numeric algorithms. Here is another program that does exactly the same, but compiles the function first.

There is one small difference between normal an compiled terms. Their Differentiate method returns a tuple that contains both the value and the gradient. There are two good reasons:

Computing the gradient of the compiled form also evaluates the function too. So why not return the result?

Most numeric algorithms require both the value and the gradient at a given point. So it just saves an additional call to Evaluate.

Equation Solving Sample

Let us create a program that solves the equation given above:

e^(-x) + x - 2 = 0

We will use the well-known Newton-Raphson method of equation solving at our disposal. We will refer to it as the NR method. First, let's create a small method that given a function f, solves the equation f(x) = 0. Our function will be given as a compiled term. The NR method requires an initial value x0, that is close enough to the real solution and also a stopping criterion. In the following example, we will use the number of iterations as the stopping criterion.

And here is the output. You can use your favorite calculator to verify the correctness of the solutions:

X1 = -1.14619322062058, X2 = 1.84140566043696

More Information

The library was originally written to be used in numeric optimization algorithms. Most such algorithms rely on the user's ability to provide the function's value and its gradient at any given point. The variety of algorithms is huge (BFGS, L-BFGS, Conjugate Gradient, Gradient Descent) but they all need gradients. It was successfully used in my thesis research project and provided excellent results.

It was created to allow developers to concentrate on modelling the target functions and constraints that solve their real world problems, freeing them from the need to perform tedious mathematical tasks. Using AD in conjunction with an optimization library developers can experiment with many different functions quickly and easily. This library brings this ability to the .NET platform.

You can download the library and play with it for your research projects or real applications and are welcome to leave feedback. We need the feedback!

Some Algorithm Details

The library works using the Reverse-Accumulation algorithm. This algorithm allows computing function gradients in linear time in the size of our function. To be more precise, we can define our function with a DAG (= directed acyclic graph) that describes how it is computed. It's similar to compiler syntax tree. The terminal nodes are variables and constants, and mathematical operators are the internal nodes. It is very similar to .NET expression trees used for LINQ. The algorithm computes the gradient in linear time with respect to the size of this graph (= number of nodes + number of edges).

Surprisingly, linear time gradient computation is actually faster than approximating the gradient numerically. When we approximate, we need to evaluate our function once for each variable (by shifting its value a bit, and then evaluating the function). This means that approximation of the gradient runs in quadratic time - linear time for each evaluations, and we have n such evaluations. You get free lunch - both the accuracy of actually computing the gradient and the speed of the Reverse-Accumulation algorithm.

Unfortunately, Reverse Accumulation is poorly suited for single-variable functions. The NR sample you saw before is not a good usage for this library and a Forward-Mode AD library is a better alternative. However the NR sample is simple enough for this article and therefore was chosen. The real performance and ease of use benefit for this library is for multivariate optimization algorithms.

Thanks.Did you visit the homepage at Codeplex?Any feedback on the library itself? The tutorials?

Also - I am planning to write a follow-up article with a real world application and I have two things in mind - image de-blurring and feed-forward neural networks learning. Which one would you like to see?