CULAhttp://www.culatools.com
GPU Accelerated Linear AlgebraThu, 17 Apr 2014 21:29:04 +0000enhourly1http://wordpress.org/?v=3.3.1Kepler has Arrivedhttp://www.culatools.com/blog/2012/04/02/kepler-has-arrived/
http://www.culatools.com/blog/2012/04/02/kepler-has-arrived/#commentsMon, 02 Apr 2012 10:00:52 +0000Johnhttp://www.culatools.com/?p=3106The big news from NVIDIA last week was the release of the first Kepler card, the GeForce GTX 680. This card features a radical expansion of cores, from 512 to 1536 (3x!), although each core is clocked slower than previous generations. Since there is only a GeForce part available, this isn't a compute-oriented release, but CUDA still runs nicely on gaming parts in single precision. We have our 680 in-house and have started working with it, and hope to post some performance results in the near future.

The natural question from our users is "so when will CULA support Kepler?" As per our normal release cadence, we will release a Kepler-enabled CULA as soon as possible after the supporting version of CUDA goes final. Note that this doesn't include any RC versions of CUDA which traditionally become available prior to the final release.

The only downside we see to the new chip here is that the double precision performance is quite low (as is traditional for gaming chips), but the single precision numbers are exciting, and many of our users do their work primarily in single precision. It's been some time since we got a new chip, so we're diving in, tuning up our solves, and seeing what kind of results we can get! We look forward to future blog posts where we detail the performance of this new generation.

Recently, while testing the latest nightly CULA build on our buildfarm, we noticed we were running into an "CUDA out of GPU memory" error after a large number of tests on certain GPUs. This was a new occurrence, and was happening on test sizes that should have easily fit within the memory of the GPU. Our first reaction was to believe that we had erroneously introduced a memory leak into the system. However, when running the cudaMemGetInfo() routine we saw that we still had well over 98% of our GPU memory available. Certainly more investigation needed to be done...

After some more testing, we quickly discovered that NVIDIA's CUBLAS helper routine, cublasDestroy_v2() was leaking a few KB of memory with every call. We promptly submitted a bug report to NVIDIA which was confirmed as an issue and is slated for a fix in the next CUDA release - pweh. However, that doesn't explain why we were still getting an "out of memory" error when it would take literally millions of calls to the leaky NVIDIA routine before we exhausted the memory on a Tesla C2070 with 6 GB.

After some more debugging, we determined we were running into the somewhat rare problem know as memory fragmentation where the GPU could not find a contiguous portion of memory to store the requested block of memory despite having a large amount of free memory overall.

Due to the way one of our tests was structured, we'd create a context, allocate a large chunk of memory, create another context, and then allocate another large chunk of memory. Upon freeing these resources, because of the NVIDIA leak, we had the contexts still floating around - one of which was now located in the middle of our memory space. After repeating this process a number of times, we ended up with a handful of small zombie contexts scattered about memory space. Because of this, it eventually became impossible to allocate a large chunk of contiguous memory.

As a work around, we were able to alleviate this problem by allocating the contexts when there was nothing else in the memory space. This caused the contexts to leak on the edge of memory rather than scattered in the middle.

This just goes to show that memory leaks of any size can be incredibly dangerous! I'd also like to note that CULA users should unaffected by this NVIDIA bug if you are using the interface properly. In the rare event you are running for multiple days, initializing the library millions of times, a call to cudaDeviceReset() will free the memory caused by the bug.

]]>http://www.culatools.com/blog/2012/03/12/3099/feed/0CUDA and Fortranhttp://www.culatools.com/blog/2012/03/02/cuda-and-fortran/
http://www.culatools.com/blog/2012/03/02/cuda-and-fortran/#commentsFri, 02 Mar 2012 21:47:15 +0000Johnhttp://www.culatools.com/?p=3088Let's start by saying that CUDA and Fortran aren't the best of friends out of the box. CUDA is a C library without true Fortran support, and Fortran isn't naturally attuned to C's value semantics. Since our users want to use our CULA Device interface routines to avoid transfers between the host and the GPU, those users also need to be able to allocate device memory. The best and easiest way, in our findings, is to use the Portland Groups's Fortran compiler, with the CUDA-Fortran language extensions. This makes CUDA a first-class citizen and so running CULA's Device interface is quite simple. Keep an eye on the upcoming issues of the Portland Group's newsletter, because we will be revising our old article about CULA + PGI integration there.

Now for those without the PGI compiler, the answer is the ISO_C_BINDING method for module writing, which allows Fortran to call into C code using the C types for pointers and with value semantics. Most newer Fortran compilers support this, and as of CULA R15 there will be available a cula_lapack_device module that takes advantage of this mode. That said, CUDA does not publish a formal module for ISO_C_BINDING integration, so you will need to write your own. Here are some sample definitions which can be pretty easily copied to produce the definitions for the CUDA routines you need.

With these examples, you can start integrating your CUDA and Fortran codes much more easily. PGI is still our preferred method, but this one works well enough for Intel and GNU Fortran compilers. The upcoming CULA R15 release will feature the publication of the modules that will allow you to integrate the CULA Device interface with this programming style.

The pCULA routines found within the CULA Dense library attempt to utilize multiple GPUs and CPUs within a single system in an effort to increase both performance and maximum problem sizes when compared to the standard CULA Dense library. This is achieved by utilizing different algorithms to distribute linear algebra problems across the GPUs and CPUs of a system.

IMPORTANT! Please note that the pCULA routines are in an alpha state and should not be used in any production code. They are merely a preview to demonstrate a sub-set of multi-GPU routines to be included in a future release of CULA Dense. It is expected that performance, accuracy, hardware requirements, and interfaces will change between the alpha and final releases.

While pCULA is still in alpha state, the basic functionality will not change much between now and the final release. We aim to provide a simple to use interface that will be easy to use, yet customizatable for user that need fine grain control over multiple devices. For example, the following code snippet shows all that is needed to utilize a pCULA function. The only added step is the creation and initializing of the control structure.

The performance of pCULA is designed to scale well for multi-GPU systems. The following chart shows the performance of a double precision Cholesky factorization (pPOTRF) when using an addition GPU.

It can be expected that as the pCULA routines move towards a final release more functions, performance, and features will be added! If you have any questions, comments, or concerns about the pCULA routines, please visit our forums.

]]>http://www.culatools.com/blog/2012/02/10/pcula-multi-gpu-support/feed/0CULA Dense R14 and Sparse S2 – Now Supporting CUDA 4.1http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/
http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/#commentsTue, 31 Jan 2012 20:58:55 +0000Johnhttp://www.culatools.com/?p=3052We're pleased to announce the release of our latest CULA Dense and Sparse versions, with full compatibility for CUDA 4.1. A major highlight of R14 is the inclusion of a preview of multi-GPU LAPACK routines, hereby called the pCULA branch of CULA Dense. Again, this is a preview designed to show potential performance as well as an interface which will likely continue to evolve over time. The new multi-GPU routines are:
pculaGetrf (LU decomposition)
pculaGetrs (LU solve)
pculaGesv (general system solve via LU)
pculaPotrf (Cholesky decomposition)
pculaPotrs (Cholesky solve)
pculaPosv (hermitian/symmetric postive-definite system solve)
pculaTrsm (BLAS triangular system solve)
pculaGemm (BLAS general matrix multiply)

An upcoming blog post will contain more on the usage and expectations of these routines, but a simple example is quite easy to create:
culaInitialize();

pculaConfig config;
pculaConfigInit(&config);
// some users may wish to tweak the default options here
// the default is to use all CUDA devices and to allow the routine
// to select the parameters it feels is best

culaStatus status = pculaPotrf(&config, m, n, A, lda);

As always, in addition to new features are bug fixes and speed/stability improvements. The full release notes for both R14 and S2 are available at the dense downloads page and the sparse downloads page, respectively.

]]>http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/feed/0Debugging with CULA Sparsehttp://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/
http://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/#commentsFri, 13 Jan 2012 20:48:00 +0000Danhttp://www.culatools.com/?p=3038
CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.

Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a previous post, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.

In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.

We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, try out the demo and see how our combination or performance and ease-of-use work for you!

]]>http://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/feed/0Not enough HPC programmers. How to fill the gap?http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/
http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/#commentsTue, 10 Jan 2012 17:42:57 +0000Lianahttp://www.culatools.com/?p=2968Engineers with top notch parallel programming experience are highly in demand in the U.S. This fact was recently pointed out in stories published by the mainstream Daily Beast, as well as HPC Wire. A quote from Stan Ahalt in the Daily Beast story caught my attention: “It’s not enough to keep building powerful supercomputers unless we have the brains. Think of a supercomputer as a very fast racing engine. We need more drivers to use those engines." Stan is the director of a supercomputing center at the University of North Carolina at Chapel Hill.

Programming supercomputers is hard work. Those involved in programming large HPC systems go through in-depth training and spend months (sometimes years) fine-tuning their algorithms until they are fully leveraging the massive computing power these machines offer. There is a growing number of tools and libraries for HPC programmers, but not necessarily suitable for all levels of computer engineers. For non HPC-experts, programming small to mid-scale systems can be a pretty challenging and time-consuming task, something we hear quite often from our customers and partners.

Where EM Photonics Can Make a Difference

Companies with recently installed small- to mid-scale supercomputing systems often need help porting their applications to their new machines. This is where we bring tremendous value. We are easy to engage with and offer in-depth understanding of parallel architectures. On top of parallel programming expertise, we bring knowledge and experience in physics-based modeling and simulation, image processing, life sciences, finance, military and defense applications. (Typically, the bigger the problem, the greater the fun!)

We encourage you to take a peak at our EM Photonics site to learn more about our consulting services, as well as current research projects and published papers. We have a team of talented engineers looking forward to tackling new challenges. Just let us know how we can help!

]]>http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/feed/0Reordering to increase parallelismhttp://www.culatools.com/blog/2011/12/19/2957/
http://www.culatools.com/blog/2011/12/19/2957/#commentsMon, 19 Dec 2011 17:03:52 +0000Kylehttp://www.culatools.com/?p=2957Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into levels where an entire level of unknowns can be solved in parallel before moving on to the next level. In the worst case, there are n levels and the entire process must be serialized. In the best case, there is one level and the entire process can be parallelized (this would only happen in a diagonal matrix).

As expected, one might seek to the decrease the number of levels in their sparse matrix in an effort to increase parallelism and (potentially) decrease the time it takes to solve their sparse triangular matrix. In many cases this can be achieved by reordering the sparse matrix. A matrix reordering simply involves swapping a number of rows and/or columns to alter the structure of the matrix while preserving the values. In many cases, reordering is performed to increase accuracy or reduce the fill-in of direct solve methods. However, in the case of CULA Sparse, we reorder the matrix to increase parallelism. In the sparse linear algebra domain, there exist a number of different reordering schemes (such as minimum degree, approximate minimum degree) but they are beyond the scope of this blog post.

This image illustrates how matrix reordering is applied to a sparse matrix. In this case, a large circuit simulation problem (AMD/G3_circuit) is reordered using the symmetric minimum degree reordering method.

One of the options for CULA Spare’s ILU0 preconditioner is reordering. This will (potentially) reduce the levels in the lower and upper triangular factors produced by the factorization in an effort to increase the parallelism. Since applying the ILU0 preconditioner through two triangular solves is typically a massive bottleneck, any speed-up here will directly decrease the total time needed to converge on a solution.

When applying matrix reordering to a real world matrix such as the circuit simulation problem introduced above, we can decrease the number of levels from 2594 to 15. This decreases the time to solve the triangular matrixes from 24.2 ms to 8.5 ms - a 2.8x speedup! When using the reordered ILU0 preconditioner with the conjugate gradient solver, we see the total time per iteration drop to 53.4 ms to 22.1 ms – 2.4x speedup!

In conclusion, CULA Sparse’s ILU0 reordering option (more info) can be used to drastically reduce the time it takes to apply the triangular factors produced by the LU factorization. However, one must also consider that the reordering step has a steep calculation overhead. Additionally, since the structure of the matrix is changing, some of the conjugate-based methods will now take a different number of iterations to converge on a solution.

]]>http://www.culatools.com/blog/2011/12/19/2957/feed/1Batched Operationshttp://www.culatools.com/blog/2011/12/09/batched-operations/
http://www.culatools.com/blog/2011/12/09/batched-operations/#commentsFri, 09 Dec 2011 18:41:01 +0000Johnhttp://www.culatools.com/?p=2946Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).

It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N3), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.

In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed.

At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.

One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.

In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the newest developer versions. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000).

As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our contact page.

]]>http://www.culatools.com/blog/2011/12/09/batched-operations/feed/0Introducing the CULA Sparse Demohttp://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/
http://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/#commentsTue, 29 Nov 2011 18:42:53 +0000Danhttp://www.culatools.com/?p=2921We are very pleased to announce that we have recently released a free demo for CULA Sparse. This demo is manifested in a standalone, command line driven program with which you can choose your options and see the performance for a particular routine. All solvers and most features that are provided by CULA Sparse are supported.

For example, to run the demo with a cg solver and jacobi preconditioner, you can use the command below. The demo accepts matrices that are in the matrix market format (.mtx). For information on this format, see the resources provided by this NIST site.

The CULA Sparse demo is powerful because it allows you to easily try our several different solvers, preconditioners, and other features without coding or building any software. And once you’ve found out the combination of inputs that is ideal for you, you can easily transition this knowledge into your CULA Sparse implementation.