Today we announced a reseller partnership with Massachussets-based Microway, a major hardware vendor in the HPC market.

Designed from the ground up for ultimate customer satisfaction, the WhisperStation™-Tesla Personal Supercomputer integrates NVIDIA Tesla Fermi C2075 GPUs.

A true value added vendor, Microway is going to help us make CULA even more accessible to developers, scientists, and researchers.

To get our partnership off to a great start, we are offering a copy of CULA R12 free of charge with the purchase of their GPU desktop machine WhisperStation.

Entirely customized, Microway's WhisperStation machines are available with up to four NVIDIA Tesla GPUs, one or two Intel Xeon CPU(s), up to 96 GB memory, and up to four 3 TB disk drives. It can be ordered directly from Microway, but keep in mind that this offer ends on December 31st and is valid only in the U.S. and Canada.

This is what Stephen Fried, CTO of Microway, had to say about it:

“GPU-accelerated WhisperStations and clusters integrated with CULA keep Microway’s customers on the leading edge of performance. Customers can achieve significant speedups simply by replacing their existing linear algebra functionality with CULA, whether it is currently provided by a library or not. We are very pleased to offer our customers CULAtools because it has such broad applicability in fields as diverse as finite element analysis, computational fluid dynamics, life sciences, and financial analysis.”

We certainly hope users will take advantage of this opportunity and join our CULA user community for the long haul.

With the release of the CULA Sparse beta, we thought it would be useful to present an introduction to sparse matrix formats. Traditionally, a matrix is considered sparse when the number of non-zero elements is significantly less than the number of zero elements. When represented in a program, sparse matrices, unlike the dense matrices used in CULA’s LAPACK functions, are a logical, compressed representation of a matrix. Whereas a dense matrix represents all elements of a matrix, including zeroes, a sparse matrix will represent only non-zero elements. An immediate benefit of this approach is that algorithmic speed improvements can be made by disregarding the zero elements for many operations. An arguably more important benefit (and a focus of this article) is that a representation that stores only non-zero elements allows the total memory used by a sparse matrix to be significantly less than it would be if it were stored densely.

Consider an 8x8 matrix, shown to the right. In this matrix, only 12 of the 64 entries (18%) are populated. If we were to adopt a sparse storage format for this matrix, we could reduce the storage by ~60%, from 512 bytes down to 192 with a compressed format.

The simplest compressed format, coordinate (COO), represents a matrix by its non-zero values and an index at which each non-zero is located. For example, in the example matrix, the value 3.0 is located at (2,2) using 1-based indexing. These indices do add to the storage cost for the matrix, but because the number of non-zeros is small, there is a net gain when compared with a dense representation. The full representation of this matrix in COO is the following:

There are several other popular sparse matrix formats in addition to coordinate format. Although coordinate format is the easiest to understand and implement, it is not always the preferred format, because other formats, such as compressed sparse row format (CSR), can increase the compression at the expense of a little bit more work. In fact, CSR is the preferred format for CULA Sparse, because of its size advantage and amenability to GPU acceleration.

In the above example, we showed only 18% of the entries as non-zero, but it is common for the sparsity (number of non-zeros) of the matrix to be much larger for some problem domains. Lower sparsity leads to lower storage requirements, which means that we can fit larger and larger problem within the memory that is available to us. For example, whereas in CULA matrices that can be solved on a typical workstation typically max out at about 16K by 16K, in CULA Sparse matrices can be as large as 100M by 100M, depending on its sparsity.

Today we are putting a spotlight on Mark van Heeswijk, a postgraduate researcher at Aalto University in Finland.

Heeswijk’s main research interest is in the field of high-performance computing and machine learning. In particular, how techniques and hardware from high-performance computing can be applied to meet the challenges one has to deal with in machine learning. His current work consists of training multiple neural networks, each on their own GPU. The particular models used in this work are a type of feedforward neural network, called Extreme Learning Machine (ELM).

How CULA has helped

“Using the CULA library, the training and model structure selection of the models can be accelerated. The training can be expressed in terms of CULA operations (with some trick to avoid needing the matrix inverse which is not part of the CULA basic package). The parallelization over multiple GPUs is achieved by combining mex-wrapped CULA with the MATLAB Parallel Computing Toolbox, and binding each of the MATLAB workers to its own GPU. Speciﬁcally, the (culaGesv) and (culaGels) functions were used, and wrappers around these functions were written, such that they can be used from MATLAB in the training and model structure selection of the ELM. This way all CULA operations within that worker will operate on that GPU. “

The paper illustrates the effect of both types of parallelization on the total running time of the algorithm. You will find the abstract and link for the paper in our Research Papers section.