Experiences with technical computing on the Microsoft .NET platform.

Main menu

Post navigation

Within the next couple of months we’ll be releasing an update to our Numerical Libraries for .NET that includes support for GPU-accelerated calculations using NVIDIA’s CUDA libraries.

We wanted to make it very easy to offload calculations to the CPU. This means the number of code changes should be minimized. At the same time, we wanted to make optimal use of the GPU. One of the most time-consuming parts of GPU computing is transferring data between CPU and GPU memory. In order to take full advantage of the processing power, data should be kept on the GPU as much as possible.

To support this functionality, we’ve introduced three new types: DistributedProvider, DistributedVector and DistributedMatrix. DistributedProvider is an abstract class that defines core functionality of the distributed computing platform such as memory management and computational kernels. At the moment, it has one concrete implementation: CudaProvider. We expect to add OpenCLProvider and possibly others (Xeon Phi, C++/AMP, MPI…) in the future.

DistributedVector and DistributedMatrix represent distributed (i.e. GPU-based) arrays that may or may not have a local copy in CPU memory. Data is copied to and from GPU memory only as needed. The result of, say, multiplying two matrices in GPU memory is a third matrix which also resides on the GPU. If further calculations are done using this matrix, its data is kept on the GPU. It is never copied to CPU memory, unless individual elements are accessed, or an operation is performed that is not supported on the GPU. In the latter case, the data is copied to local memory and the CPU-based implementation is used. We also fall back on the CPU-based code automatically if the GPU runs out of memory, or if there is no CUDA GPU at all.

So what does the code look like? Well, here is a small example that runs 100 iterations of the power method for computing the largest eigenvalue of a matrix using various configurations. The core computation is a matrix-vector product. We use the same code for the CPU-based and the GPU-based calculation.

As you can see, the only difference between the CPU and GPU versions is that we called MakeDistributed on the input arguments.

In our benchmark we added a third variation that touches the matrix on the CPU during each iteration. This forces the matrix to be copied to the GPU in each iteration, which is similar to what happens with naive offloading. Here are the results:

This is on a GPGPU-poor GTX 680, with 1/24 double-precision capacity, compared to 1/2 for Titan/Kepler cards, and using Intel’s Math Kernel Library version 11. It clearly shows that using our implementation, the GPU version is competitive for moderate sized matrices (n=500) and really charges ahead for larger problems, while the simple offloading technique never quite catches up.

One of our core objectives in creating our Numerical Libraries for .NET was to try to close the gap between the exploratory phase of development and the implementation phase, when the results are incorporated into applications. We wanted to address the needs of one of the fastest growing categories of users: data scientists.

Several tools are available already, but few come close to fulfilling our objectives. R is the most widely used platform for data analysis. Unfortunately the interaction with .NET is painful at best. For Python users, pandas is an excellent choice. It has been around for a few years and in that time has really matured in terms of API and performance. But of course, it’s Python only.

Blue Mountain Capital recently released Deedle, an open source library for exploratory data analysis written in F# by the brilliant Tomas Petricek and others. The library is fairly complete and has a nice API. especially for F# users. Using F#’s interactive REPL with Deedle, it is truly possible to do interactive exploratory data analysis with .NET.

Although a good effort has been made to make the library easy to use from other languages like C# and VB.NET, it is still more awkward than it could be. However, as we’ll see, the biggest problem with Deedle is performance.

Our own library for exploratory data analysis, which is well on its way to completion, is still convenient but maintains a high level of performance comparable to, and sometimes exceeding, that of pandas. I wanted to give you a preview of what’s coming.

Data Frame basics

The basic object in data analysis is a data frame. A data frame looks a lot like a database table. It is a two-dimensional table where the values in each column must be the same, but values in different columns can have different types. It also has a row index and a column index. The column index is used to select specific columns, usually with strings as keys.

Similarly, the row index can be used to select rows, but it also enables automatic data alignment. Imagine two time series, where some dates that are present in one series may not be present in another. If you want to compute the difference between the values in the two series, you want to make sure that you are comparing values for the same date. Here’s a quick example:

We first create two vectors indexed using some strings. You’ll notice that the second vector has an extra key (e), and another key (b) is missing. If we add the two vectors, here is what we get:

a: 11
b: -
c: 33
d: 44
e: -

The result has a value for all five keys. It has added values with corresponding keys correctly. Where a key was present in only one vector, a missing value was returned.

All vectors in the Extreme Optimization Numerical Libraries for .NET may have an index, and all matrices may have a row and a column index. Any operations involving indexed arrays will automatically align the data on the indexes.

Indexes on vectors and matrices are weakly typed in that the type of the keys is not part of the type definition. The keys are stored as strongly typed collections, of course.

A data frame is different from an indexed matrix in two ways. First, not all elements must be of the same type. Different columns can have different element types. Second, the indexes on a data frame are strongly typed, while the columns are weakly typed. You can easily access rows and columns by key, but if you want a strongly typed column, you need to do a tiny bit of extra work:

Notice that the row index on the data frame once again contains all keys in the two indexes, and missing values have been inserted in a vector where the key was not present.

Titanic survivor analysis in 20 lines of C#

Let;s do something more complicated. The Deedle home page starts with an example that computes the percentages of people who survived the Titanic disaster grouped by gender. The code shown is currently 20 lines. Using our library, the C# code is also 20 lines, and would look like this:

After reading the data into a DataFrame from a CSV file, we can compute a pivot table that shows the count for each combination of Sex and Survived. We then add a column that computes the total number of people for each gender. Finally, we construct a new data frame by dividing the first two columns by the total column.

F# has somewhat more succinct syntax, thanks mainly to its better type inference and support for custom operators. We can do the same thing in just 14 lines:

Comparing Deedle and Extreme Optimization Performance

So, how does this code perform? Well, we added some timing code, making sure to ‘warm up’ the code so we’re only testing the actual running time and not time spent JIT’ing the code. Here is what we found on an older i7 930:

That’s right: our implementation is around 500 times faster than Deedle. We were as surprised as anyone by these results. You just don’t get these kinds of performance differences very often. So we looked a bit further. We summarized our findings in a chart:

The chart compares running times (in ms) of 9 benchmarks using the Extreme Optimization library and Deedle running as both x86 and x64. Note that we had to use a logarithmic scale for the running time!

The conclusion: the Extreme Optimization library is 1 to 3 orders of magnitude faster than Deedle.

A couple of points on why these results aren’t entirely fair. First, the two versions don’t always do the same thing. For example: in the time series benchmark, we take advantage of the fact that the indexes are sorted to speed up the alignment. Verifying that the index is sorted is only done once. There are no doubt other cases where our code incurs a cost only once. Second, we use some specific optimizations that could be implemented in Deedle with relative ease. For example: we use an incremental method to compute the moving average, whereas Deedle re-computes the average for each window from scratch.

Still, a difference of 1-3 orders of magnitude is huge, and it begs for an explanation. I believe it’s rather simple, actually:

Performance was not a priority in the development of Deedle, especially in this initial release. (We used the first official release version, 0.9.12.) We expect performance to improve as the library matures. Still, 3 orders of magnitude will be hard to make up. Which brings me to my other point:

Sometimes, the F# language makes it a little too easy to write compact, elegant code that is very inefficient. For example: Types like tuples and discriminated unions are very convenient, but they are reference types, and so they have consequences for memory locality, cache efficiency, garbage collections…

Summary

Our upcoming data frame library will provide data scientists with a convenient way of interactively working with data sets with millions of rows. Compared to the only other library of its kind, Deedle, performance is currently 1 to 3 orders of magnitude better. A beta version of the library will be published soon. Contact me if you want to participate.

With the release of version 5.0 of our Numerical Libraries for .NET, we’ve also considerably improved our support for the F# language. In this article I want to walk you through setting up F# Interactive to use the Extreme Optimization Numerical Libraries for .NET, and show off some of the new features. This tutorial will focus on F# 3.0 in Visual Studio 2012. Most of the functionality also works with F# 2.0 in Visual Studio 2010.

Setting up F# Interactive

The first step is to bring the Extreme Optimization Numerical Libraries for .NET assemblies into the F# Interactive environment:

We do this by first adding the path to the Extreme Optimization binaries to the DLL search path using the #I command. (Note: the path depends on the installation folder and may be different on your machine.) We the main assembly Extreme.Numerics.Net40, and the F# specific assembly Extreme.Numerics.FSharp30.Net40. Finally, we tell fsi to use a pretty printer for objects that implement the ISummarizable interface. This will print out vectors and matrices in a user-friendly form.

Working with vectors and matrices

So let’s get our hands dirty right away. When we referenced the Extreme.Numerics.FSharp30.Net40 dll, this automatically opened several modules that make it easier to do linear algebra in F#. We can call Vector and Matrix factory methods, or we can use some convenient F# functions:

Notice the !! operator, which creates a vector from a list or sequence. We can access elements of vectors and matrices using the familiar F# syntax. Slices work too, as do integer sequences and predicates:

The last command solved the system of equations represented by the matrix A and solved it for the right-hand side b. We can also work with matrix decompositions. There are actually two ways. Let’s use the SVD as an example. First, we can get the factors of the decomposition.

Notice how we used partial application of the Bessel function to get a function that returns the Bessel function of order 0. We can do something similar in many other places, for example with binomial coefficients:

Now to the functional stuff. Two of the most basic operations on functions are differentiation and integration. Differentiation is done using the d function. Integration is done using integrate1. Partial application is supported, so you can create a function that computes the numerical derivative at any value you give it.

Finding the minimum or maximum of a function can be difficult because most software requires that you supply the gradient of the function. Although computing derivatives is high school calculus, it’s still very error prone. Automatic differentiation comes to the rescue here. We’ll use the SymbolicMath class, which contains all kinds of functions that take advantage of symbolic computations to obtain a solution.

One of the most famous optimization problems is the Rosenbrock function. We can find the minimum of this function in just one line of code:

Python is used more and more for numerical computing. It has always been possible to call into it from IronPython. However, IDE support was minimal and some of the more convenient features of Python, like slicing arrays, were not available.

In January, Microsoft announced the availability of Python Tools for Visual Studio. This is a big step forward in IDE’s for Python development on Windows.

Now, with our new IronPython interface library you can take advantage of the following integration features:

Sample programs

We’ve converted over 60 of our QuickStart samples to Python scripts. The samples folder contains a solution that contains all the sample files. To run an individual sample, find it in Solution Explorer and choose “Set As Startup File” from the context menu. You can then run it in the interactive window by pressing Shift+Alt+F5.

This week’s update of the Extreme Optimization Numerical Libraries for .NET includes the ability to solve quadratic programming (QP) problems. These are optimization problems where the objective function is a quadratic function and the solution is subject to linear constraints.

Our QP solver uses an active set method, and can be used to solve programs with thousands of variables. Quadratic programming has many applications in finance, economics, agriculture, and other areas. It also serves as the core of some methods for nonlinear programming such as Sequential Quadratic Programming (SQP).

We created some QuickStart samples that demonstrates how to use the new functionality:

In this post, I’d like to illustrate an application in finance: portfolio optimization.

Say we have a set of assets. We know the historical average return and variances of each asset. We also know the correlations between the assets. With this information, we can look for the combination of assets that best meets our objectives. In this case, our goal will be to minimize the risk while achieving a minimal return. Now let’s see how we can model this.

The unknowns in our model will be the amount to be invested in each asset, which we’ll denote by xi. All values together form the vector x. We will assume that all xi ≥ 0.

We can use the variance of the value of the portfolio as a measure for the risk. A larger variance means that the probability of a large loss also increases, which is what we want to avoid. If we denote the variance-covariance matrix of the assets by R, then the variance of the asset value is xTRx. This is what we want to minimize.

Now let’s look at the constraints. We obviously have a budget, so the sum of all amounts must be less than or equal to our budget. Say our budget is $10,000, then we must have Σ xi ≤ 10,000. We also want a minimum expected return on our investment, say 10%. If ri is the return associated with asset i, then the total expected return for our investment is Σ ri xi, and we want this value to be at least 10% of our total budget, or $1,000.

So, in summary, our model looks like this:

Minimize

xTRx

Subject to

Σ xi ≤ 10000 Σ ri xi ≥ 1000 xi ≥ 0

The last thing we need to do is formulate this quadratic program in terms of the optimization classes in the Numerical Libraries for .NET. The QuickStart samples show three ways of doing this: you can define the model directly in terms of vectors and matrices; you can build the model from scratch, or you can read it from a model definition file in MPS format. Since the problem above is already pretty much expressed in terms of vectors and matrices, we’ll use that approach here.

Quadratic programming problems are represented by the QuadraticProgram class. One of the constructors lets you create a model in so-called standard form. A quadratic program in standard form looks like this:

Minimize

c0+ cTx + ½xTHx

Subject to

AEx = bEAIx ≤ bIl ≤ x ≤ u

where c0 is a scalar, H, AE, and AI are matrices, and all other values are vectors. The subscript E denotes quality constraints, while the subscript I denotes inequality constraints. In our case, we don’t have equality constraints.

As a concrete example, we will work with four assets with returns 5%, -20%, 15% and 30%, respectively. Our budget is $10,000, and the minimum return is 10%. The implementation is quite straightforward. Noteworthy is that, since in the standard form the right hand side is an upper bound, we have to change the sign of the minimum return constraint which had a lower bound right-hand side.

After we’ve run this code, the vector x will contain the optimal amounts for each asset. It gives us: $3,453, $0, $1,069, $2,223. Not surprisingly, the second asset, which has had a negative return on average, should not be part of our portfolio. However, it turns out that the asset with the lowest positive return should have the largest share.

This week we introduced two new features in the Extreme Optimization Numerical Libraries for .NET.

Trigonometric functions with large arguments

The .NET Framework implementation of the trigonometric functions, sine, cosine, and tangent, relies on the corresponding processor instruction. This gives extremely fast performance, but may not give fully accurate results. This is the case for even fairly small arguments.

For example, Math.Sin(1000000.0) returns -0.34999350217129177 while the correct value is -0.34999350217129295. So we’ve already lost at least 2 digits. And things just go downhill from there. At 1012, we only get 8 good digits.

For even larger arguments, something quite unexpected happens. The result of Math.Sin(1e20) is… 1e20! The argument is returned unchanged! Not only is this a completely meaningless return value. It also can cause a calculation to fail if it relies on the fact that sine and cosine are bounded by -1 and +1.

To understand what is going on, we need to go back to the implementation.

Math.Sin, Math.Cos and Math.Tan rely on processor instructions like fsin and fcos. These instructions work by first reducing the angle to a much smaller value, and then using a table-based approach to get the final result. Moreover, according to the documentation, the argument must be strictly between –263 and 263. The return value is not defined for arguments outside this interval. This explains the complete breakdown for large arguments.

The explanation for the gradual loss of accuracy is more subtle. As I said, the computation is done by first reducing the argument to a smaller interval, typically (–π, π]. The argument x is rewritten in the form

x = n π + f

where n is an even integer, and f is a real number with magnitude less than π.

The first step is to divide the number by π. The quotient is rounded to the nearest even integer to give us n. We get the reduced value f by subtracting this number times π from x. Because of the periodicity of the trigonometric functions, the value of the function at x is the same as the value at f, which can be calculated accurately and efficiently.

Now, if you start with a value like 10300, you’ll need to divide this number by π with at least 300 digits of precision, round the quotient to the nearest even number, and subtract π times this number from 10300. To do this calculation accurately over the whole range, we would need to work with a value of π that is accurate to 1144 bits.

Even today, this is just a little bit beyond the capabilities of current hardware. Moreover, it could be considered wasteful to spend so much silicon on a calculation that is not overly common. Intel processors use just 66 bits. This leaves us with just 14 extra bits. The effects will begin to show with arguments that are larger than 216.

Accurate argument reduction for trigonometric functions

So, there are two problems with the standard trigonometric functions in the .NET Framework:

The accuracy decreases with increasing size of the argument, losing several digits for even modest sized arguments.

When the argument is very large, the functions break down completely.

To address these two problems, we’ve added Sin, Cos and Tan methods to the Elementary class. These methods perform fully accurate argument reduction so the results are accurate even for huge values of the argument.

For example, the 256 digit floating-point number 6381956970095103 2797 is very close to a multiple of π/2. In fact, it is so close that the first 61 binary digits after the period of the reduced value are 0. As expected, Math.Cos gives the completely meaningless result 5.31937264832654141671e+255. Elementary.Cos returns -4.6871659242546267E-19, which is almost exactly the same as what Wolfram|Alpha gives: 4.6871659242546276E-19. The relative error is about the size of the machine precision.

What about performance? Thanks to some tricks with modular multiplication, it is possible to do the reduction in nearly constant time. Moreover, since the reduction only needs to be done for larger arguments, the overhead in most situations is minimal. Our benchmarks show a 4% overall slowdown when no reduction is needed. For smaller values, the operation can take close to twice as long, while in the worst case, for large arguments that require a full reduction, we see a 10x slowdown.

Transcendental functions for System.Decimal

Also new in this week’s update is the DecimalMath class, which contains implementations of all functions from System.Math for the decimal type. All trigonometric, hyperbolic, exponential and logarithmic functions are supported, as well as the constants e and π.

The calculations are accurate to the full 96 bit precision of the decimal type. This is useful for situations where double precision is not enough, but going with an arbitrary precision BigFloat would be overkill.

We just released an update to our Extreme Optimization Numerical Libraries for .NET that adds some new classes. We’ve added two non-parametric tests: the Mann-Whitney and Kruskal-Wallis tests. These are used to test the hypothesis that two or more samples were drawn from the same distribution. The Mann-Whitney test is used for two samples. The Kruskal-Wallis test is used when there are two or more samples.

For both tests, the test statistic only depends on the ranks of the observations in the combined sample, and no assumption about the distribution of the populations is made. This is the meaning of the term non-parametric in this context.

The Mann-Whitney test, sometimes also called the Wilcoxon-Mann-Whitney test or the Wilcoxon Rank Sum test, is often interpreted to test whether the median of the distributions are the same. Although a difference in median is the dominant differentiator if it is present, other factors such as the shape or the spread of the distributions may also be significant.

For relatively small sample sizes, and if no ties are present, we return an exact result for the Mann-Whitney test. For larger samples or when some observations have the same value, the common normal approximation is used.

The Kruskal-Wallis test is an extension of the Mann-Whitney test to more than two samples. We always use an approximation for the distribution. The most common approximation is through a Chi-square distribution. We chose to go with an approximation in terms of the beta distribution that is generally more reliable, especially for smaller sample sizes. For comparison with other statistical software, the chi-square p-value is also available.

We created some QuickStart samples that illustrate how to use the new functionality:

This is now the fifth year I’ve been writing numerical software for the .NET platform. Over these years, I’ve discovered quite a few, let’s call them ‘unfortunate’, design decisions that make writing solid and fast numerical code on .NET more difficult than it needs to be.

What I’d like to do in the coming weeks is list some of the improvements that would make life easier for people in our specialized field of technical computing. The items mentioned in this post aren’t new: I’ve written about them before. But it’s nice to have them all collected in one spot.

Fix the “Inlining Problem“

First on the list: the “inlining“ problem. Method calls with parameters that are value types do not get inlined by the JIT. This is very unfortunate, as it eliminates most of the benefit of defining specialized value types. For example: it’s easy enough to define a complex number structure with overloaded operators and enough bells and whistles to make you deaf. Unfortunately, none of those operator calls are inlined. You end up with code that is an order of magnitude slower than it needs to be.

Even though it has been the top performance issue for several years, there is no indication yet that it will be fixed any time soon. You can add your vote to the already sizeable number on Microsoft’s product feedback site.

Support the IEEE-754 Standard

Over 20 years ago, the IEEE published a standard for floating-point arithmetic that has since been adopted by all major manufacturers of CPU’s. So, platform independence can’t be an issue. Why then is it that the .NET Framework has only the most minimal support for the standard? Surely the fact that people took the time to come up with a solid standard, and the fact that it has enjoyed such wide support from hardware vendors should be an indication that this is something useful and would greatly benefit an important segment of the developer community.

This is another topic I’ve written about before. In a nutshell: C# and VB.NET don’t support custom overloaded assignment operators at all. C++/CLI supports them, and purposely violates the CLI spec in the process – which is a good thing! One point I would like to add: performance isn’t the only reason. Sometimes there is a semantic difference. Take a look at this code:

which could be part of the code for computing the LU Decomposition of a matrix. The GetRow method returns the row of the matrix without making a copy of the data. The code inside the loop subtracts a multiple of the pivot row from the current row. With the current semantics where x -= y is equivalent to x = x – y, this code does not perform as expected.

What I would like to see is have the CLI spec changed to match what C++/CLI does. Compound assignment operators should be instance members.

Still to come: a proposal for some (relatively) minor modifications to .NET generics to efficiently implement generic arithmetic, better support for arrays, and more.

Microsoft today announced their latest addition to the .NET family: the Dynamic Language Runtime (DLR). As Jim Hugunin points out, it is based on the IronPython 1.0 codebase, but has been generalized so it can support other dynamic languages, including Visual Basic.

Now, the word ‘dynamic’ here is often misunderstood. Technically, the word dynamic refers to the type system. The .NET CLR is statically typed: every object has a well-defined type at compile time, and all method calls and property references are resolved at compile time. Virtual methods are somewhat of an in-between case, because which code is called depends on the runtime type, a type which may not even existed when the original code was compiled. But still, every type that overrides a method must inherit from a specific parent class.

In dynamic languages, the type of variables, their methods and properties may be defined at runtime. You can create new types and add properties and methods to existing types. When a method is called in a dynamic language, the runtime looks at the object, looks for a method that matches, and calls it. If there is no matching method, a run-time error is raised.

Writing code in dynamic languages can be very quick, because there is rarely a need to specify type information. It’s also very common to use dynamic languages interactively. You can execute IronPython scripts, but there’s also a Console that hosts interactive IronPython sessions.

And this is where it gets confusing. Because leaving out type information and interactive environments come naturally to dynamic languages, these features are often thought of as properties of dynamic languages. They are not.

Ever heard of F#? It is a statically typed, compiled language created by Don Syme and others at Microsoft Research. It can be used to build libraries and end-user applications, much like C# and VB. But it also has an interactive console and eliminates the need for most type specifications through a smart use of type inference.

F# is not a dynamic language in the technical sense: it is statically typed. But because it has an interactive console and you rarely have to specify types, it is a dynamic language in the eyes of a lot of people. In fact, at the Lang.NET symposium hosted by Microsoft last August, people were asked what their favorite dynamic language is. Many answered with F#. And these were programming language designers and compiler writers!

Anyway, the point I wanted to make with this post is that the new Dynamic Language Runtime has great support for both the technically dynamic languages (dynamic types) and the perceived as dynamic features like interactive environments. I hope the distinction between these two aspects will be clarified in the future.

Last week, the latest edition of the list of the 500 fastest supercomputers was released. Two recent developments make this list interesting.

The arrival of multicore processors. Even though their presence is still modest on the current list, expect their share to rise. Intel is targeting 32 cores on a chip by 2010.

Microsoft made its entry on the scene with Windows Compute Cluster Server 2003, an enhanced Windows 2003 Enterprise Server version tweaked for High Performance Computing. The first (and so far the only) entry on the Top500 list is at the National Center for SuperComputing at the University of Illinois. It will be interesting to see how this number grows in the coming years. At the very least, it will give some indication of the headway Microsoft is making in the HPC market.

Some trivia:

The #1 spot is still held by IBM’s BlueGene/L supercomputer at the Lawrence Livermore National Laboratory. At over 280TFlops, this monster is faster than an IBM PC with 8087 co-processor by a factor of roughly one billion. That’s right: it’s as fast as 1,000,000,000 original IBM PC’s!

The first Top500 list was published in June 1993. It’s interesting to note that one dual processor machine based on Intel’s latest dual-core processors would, at 34.9GFlops, take the #2 spot on that original list. Today’s average desktop would make it into the top 100…