cuBLAS-XT – Accelerate BLAS calls with multiple GPUs!

cuBLAS-XT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.

Starting with CUDA 6.0, a free version of cuBLAS-XT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.

The premier version of cuBLAS-XT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!

NVBLAS

NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cuBLAS-XT, and is included with the CUDA tookit. All versions of cuBLAS-XT work with NVBLAS.