A common question we're asked with CULA whether it's possible to simply swap out an existing LAPACK library for CULA. This short answer to this question is YES! In CULA 1.1 we introduced our "Bridge" interface, which provides a source level compatible interface so that you can swap out whatever LAPACK version you are using with CULA. By using this interface, you can get CULA's GPU acceleration without making any modifications to your code.

Our Bridge interface provides a layer that matches the different interface of 3 different linear algebra packages: CLAPACK, MKL, and ACML. This layer is implemented as a thin source code wrapper that matches the function names and signatures of these linear algebra packages. All you need to do to use this capability is swap in a CULA header and add CULA to your programs link settings.

Another great feature of the Bridge interface is that it provides a fallback to CPU execution when a user does not have a GPU or when problem size is too small to take advantage of GPU execution. In this way you can manage only one codebase yet include GPU-acceleration whenever a user has a supported GPU. You can even tune when to switch between CPU and GPU for your exact problems.

We put a lot of effort into making CULA as easy as possible and our Bridge interface is one of these ways. Try out the bridge example in our SDK and see how easy it is to get GPU-accelerated linear algebra with almost no effort at all.

We often get the question asking us exactly how we produce our publicly displayed numbers. This is a fair question and one that it is always good to see asked. Benchmarking is hard to do fairly, and even more so when the benchmarks are going to be used for promotion or compared against other similar benchmarks. As a developer of GPU libraries, we're constantly on the lookout for benchmarks related to programming GPUs, and we're surprised by the errors that many programmers make when benchmarking their code. Today I'm going to talk about the benchmarking policies we've put into place for CULA and discuss how we avoid the pitfalls in benchmarking a GPU program.

When we designed our benchmarking policies for CULA we wanted our numbers to be unimpeachable, and so we decided to show our best performance but only in practical circumstances and only compared to high-quality competitors. Although we compare CULA against many different packages, we choose only to publish benchmarks against the most highly tuned ones. Believe me, it feels great to see 100x performance out of CULA against packages like a single-threaded LAPACK implementation, but it would be unfair to lean on these when better implementations exist. In many other fields this would be a perfectly adequate comparison - for example, sometimes our customers bring us custom, untuned code and we get performance numbers like 100x on a regular basis. But it's a different story in the linear algebra field, where the algorithms are well known, highly modular, and there are incredibly well tuned multi-core implementations out there.

After choosing the packages you want to compare against, there is a second level, which is how to set up the problem. Too often GPU code is benchmarked in a loop of a hundred iterations, which may not be how a kernel is called in the real world, hiding the true cost of using the code. Or worse -- a user omits a cudaThreadSynchronize before marking the end of the test; it will probably look like they're getting great performance, but they're not counting the execution fairly because kernel launches are asynchronous and may not be completed yet. Also, did they check for errors or not - that takes time too! And it seems that in almost every circumstance we've encountered, memory allocation time is not counted, even though it's quite necessary for functionality (the rationale, of course, being that the hundreds of loops amortize this cost to zero, but it's hard to be certain without benchmarking). As a library vendor, we want our users to have the best experience possibly when using our code and so we always synchronize and check for errors at the library boundary so that we can do as much with these errors as possible -- instead of putting the checking and error handling on our users. Because we do this (and we feel that any good code should), we choose to count all of these "overheads" in the times we report in our benchmarks.

Beyond actually timing your code, there is also the concern of fairness in selecting which results to present. It can be easy to cherry-pick your best results, presenting a false impression of how well your code really runs. To be as fair as possible, our method is to select the tests we want to presentbefore running any tests. For each routine, we took the parameters and job combinations we felt best present a common and "real world" usage pattern. Only after this choice do we run our tests and present our findings. So you can trust that there is no funny business involved, such as "I'm getting 1.2x on average, but my code runs great for matrices sized precisely 1040x1040 so I'll use that."

What's really great is that while we tend to present conservative numbers for our own code, our users are under no such constraints. We're always happy when we see someone with a top of the line GPU and an aging CPU benchmark our code and get a speedup of 10-15x! For obvious reasons we can't market ourselves with that kind of testing, but it's always good to get a reminder that in real world usage everyone has a different setup and a different set of needs. Although we release benchmarks on a top-of-the-line machine, as a cross-OS, cross-GPU-generation, cross-CPU, cross-language library vendor, we try to get the best performance for all of our users and you can be sure that CULA will perform great no matter what you're using.

Since we put up yesterday's post we've gotten the question about what our buildfarm looks like. To answer this question, we currently have 7 dedicated machines, each running one of Windows XP 32/64, Ubuntu 32/64, RHEL 32/64, and Mac OS X 10.5, respectively. We also use a variety of GPUs in our farm so that we can test CULA's performance across a wide range of devices including GeForce, Quadro, and Tesla cards from different hardware generations.

When you multiply over 50,000 tests against 7 machines, the number of tests you get is just staggering.