Bio-Sequence Database Scanning on a GPU

Transcription

1 Bio-Sequence Database Scanning on a PU Weiguo Liu, Bertil Schmidt, errit Voss, ndre Schroder, and Wolfgang Muller-Wittig Nanyang echnological University School of omputer Engineering entre for dvanced Media echnology Singapore bstract Protein sequences with unknown functionality are often compared to a set of known sequences to detect functional similarities. Efficient dynamic programming algorithms exist for this problem, however current solutions still require significant scan times. hese scan time requirements are likely to become even more severe due to the rapid growth in the size of these databases. In this paper, we present a new approach to bio-sequence database scanning using computer graphics hardware to gain high performance at low cost. o derive an efficient mapping onto this type of architecture, we have reformulated the Smith- Waterman dynamic programming algorithm in terms of computer graphics primitives. Our OpenL implementation achieves a speedup of approximately sixteen on a high-end graphics card over available straightforward and optimized PU Smith-Waterman implementations.. Introduction Scanning protein sequence databases is a common and often repeated task in molecular biology. he need for speeding up these searches comes from the rapid growth of the bio-sequence banks: every year their size is scaled by a factor. to. he scan operation consists of finding similarities between a particular query sequence and all sequences of a bank. his allows biologists to identify sequences sharing common subsequences, which from a biological viewpoint have similar functionality. omparison algorithms whose complexities are quadratic with respect to the length of the sequences detect similarities between the query sequence and a subject sequence. One frequently used approach to speed up this time consuming operation is to introduce heuristics in the search algorithm. he drawback is that the more efficient the heuristics, the worse is the result. nother approach to get high quality results in a short time is to use high performance computing. In this paper, we investigate how commodity computer graphics hardware can be used as a computational platform to accelerate database scanning. he fast increasing power of the PU (raphics Processing Unit) and its streaming architecture opens up a range of new possibilities for a variety of applications. With the enhanced programmability of commodity PUs, these chips are now capable of performing more than the specific graphics computations they were originally designed for. Recent work on PPU (eneral-purpose computation on PUs) shows the design and implementation of algorithms for non-graphics applications. Examples include scientific computing [], computational geometry [], image processing [], and bioinformatics [, ]. he evolution of PUs is driven by the computer game market. his leads to a relatively small price per unit and to very rapid developments of next generations. urrently, the peak performance of high-end PUs such as the eforce X is approximately ten times faster than that of comparable PUs. Further, PU performance has been increasing from two to twoand-a-half times a year (see Figure ). his growth rate is faster than Moore s law as it applies to PUs, which corresponds to about one-and-a-half times a year []. onsequently, PUs will become an even more attractive alternative for high performance computing in the near future. However, there are still a number of challenges to be solved in order to enable scientists other than computer graphics specialists to facilitate efficient usage of these ---//$. IEEE

2 resources within their research area. he biggest challenge in order to solve a specific problem using PUs is reformulating the proposed algorithms and data structures using computer graphics primitives (e.g. triangles, textures, vertices, fragments). Furthermore, restrictions of the underlying streaming architecture have to be taken into account, e.g. random access writes to memory is not supported and no cross fragment data or persistent state is possible (e.g. all the internal registers are flushed before a new fragment is processed). In this paper we show how bio-sequence database scanning based on the Smith-Waterman dynamic programming algorithm can benefit from this type of computing power. FLOPS NVIDI(NV NV NV ) I(R R R) IntelPentium (single-coreexceptwheremarked) dual-core Y ear Figure. Peak performance comparison (measured on the multiply-add instruction) of PUs and PUs in recent years. Figure taken from Owens et al. []. he rest of this paper is organized as follows. In Section, we introduce the basic sequence alignment algorithm. Section highlights previous work on parallelization of this algorithm on different parallel architectures. Important features of the PU architectures are described in Section. Section presents our mapping of the algorithm onto the PU architecture. performance evaluation is given in Section. Finally, Section concludes the paper with an outlook to further research topics.. Smith-Waterman lgorithm Surprising relationships have been discovered between protein sequences that have little overall similarity but in which similar subsequences can be found. In that sense, the identification of similar subsequences is probably the most useful and practical method for comparing two sequences. he Smith-Waterman algorithm [] finds the most similar subsequences of two sequences (the local alignment) by dynamic programming (DP). he algorithm compares two sequences by computing a distance that represents the minimal cost of transforming one segment into another. wo elementary operations are used: substitution and insertion/deletion (also called a gap operation). hrough series of such elementary operations, any segments can be transformed into any other segment. he smallest number of operations required to change one segment into another can be taken into as the measure of the distance between the segments. onsider two strings S and S of length l and l. o identify common subsequences, the Smith- Waterman algorithm computes the similarity H(i, j) of two sequences ending at position i and j of the two sequences S and S. he computation of H(i, j), for i l, j l, is given by the following recurrences: H(i, j) =max{,e(i, j),f(i, j),h(i,j )+sbt(s [i],s [j])} E(i, j) =max{h(i, j ) α, E(i, j ) β}, F (i, j) =max{h(i,j) α, F (i,j) β}, where sbt is a character substitution cost table. Initialization of these values are given by H(i, ) = E(i, ) = H(,j) = F (,j) = for i l, j l. Multiple gap costs are taken into account as follows: α is the cost of the first gap; β is the cost of the following gaps. his type of gap cost is known as affine gap penalty. Some applications also use a linear gap penalty, i.e. α = β. For linear gap penalties the above recurrence relations can be simplified to: H(i, j) =max{,h(i, j ) α, H(i,j) α, H(i,j ) + sbt(s [i],s [j])}, Each position of the matrix H is a similarity value. he two segments of S and S producing this value can be determined by a trace-back procedure. Figure. illustrates an example.. Related Work number of parallel architectures have been developed for bio-sequence database scanning with the Smith-Waterman algorithm. In addition to architectures specifically designed for sequence analysis, existing programmable sequential and parallel architectures

3 \ \ Figure. Example of the Smith-Waterman algorithm to compute the local alignment between two DN sequences and. he matrix H(i, j) is shown for the linear gap cost α =, and a substitution cost of + if the characters are identical and otherwise. From the highest score (+ in the example), a traceback procedure delivers the corresponding alignment, the two subsequences and. have been used for solving sequence alignment problems. Special-purpose architectures can provide the fastest means of running a particular algorithm with very high processing element (PE) density. Each PE is specifically designed for the computation of one cell in the DP matrix (see Figure ). However, such architectures are limited to one single algorithm, and thus cannot supply the flexibility necessary to run a variety of algorithms required for analyzing DN, RN, and proteins. P-N was the first such machine and computed edit distance over a four-character alphabet []. More recent examples, better tuned to the needs of computational biology, include BioScan [], BISP [], and SMB []. n approach presented in [] is based on instruction systolic arrays (ISs). ISs combine the speed and simplicity of systolic arrays with flexible programmability. Several other approaches are based on the SIMD concept, e.g. MP [], Kestrel [], and Fuzion []. SIMD and IS architectures are programmable and can be used for a wider range of applications, such as image processing and scientific computing. Since these architectures contain more general-purpose parallel processors, their PE density is less than the density of special-purpose SIs. Nevertheless, SIMD solutions can still achieve significant runtime savings. However, the costs involved in designing and producing SIMD architectures are quite high. s a consequence, none of the above solutions has a successor generation, making upgrading impossible. Reconfigurable systems are based on programmable logic such as field-programmable gate arrays (FPs). hey are generally slower and have lower PE densities than special-purpose architectures, e.g. [, ]. hey are flexible, but the configuration must be changed for each algorithm, which is generally more complicated than writing new code for a programmable architecture. Solutions based on FPs have the additional advantage that they can be regularly upgraded to stateof-the-art-technology. ll these approaches can be seen as accelerators - an approach satisfying the demand for a low cost solution to compute-intensive problems. he main advantage of PUs compared to the architectures mentioned above is that they are commodity components. In particular, most users have already access to Ps with modern graphics cards. For these users this direction provides a zero-cost solution. Even if a graphics card has to be bought, the installation of such a card is trivial (plug and play). Writing the software for such a card does still require specialist knowledge, but new high level languages such as g [] offer a simplified programming environment.. PU rchitecture vertices (D) Vertex Processor Vertex vertices (D) Rasterizer exture Memory textures Processor s Render-totexture final/ temp pixels Frame Buffer Figure. Illustration of the PU graphics pipeline. omputation on a PU follows a fixed order of processing stages, called the graphics pipeline (see Figure ). he pipeline consists of three stages: vertex processing, rasterization and fragment processing. he

4 vertex processing stage transforms three-dimensional vertex world coordinates into two-dimensional vertex screen coordinates. he rasterizer then converts the geometric vertex representation into an image fragment representation. Finally, the fragment processor forms a color for each pixel by reading texels (texture pixels) from the texture memory. Modern PUs support programmability of the vertex and fragment processor. programs for instance can be used to implement any mathematical operation on one or more input vectors (textures or fragments) to compute the color of a pixel. In order to meet the ever increasing performance requirements set by the gaming industry, modern PUs use two types of parallelism. Firstly, multiple processors work on the vertex and fragment processing stage, i.e. they operate on different vertices and fragments in parallel. For example, a typical mid-range graphics card such as the nvidia eforce has vertex processors and fragment processors. Secondly, operations on -dimensional vectors (the four channels Red/reen/Blue/lpha (RB)) are natively supported without performance loss. Input stream Filter Kernels Filter n Output stream Figure. Streaming model that applies kernels to an input stream and writes to an output stream. Several authors have described modern PUs as streaming processors, e.g []. Streaming processors read an input stream, apply kernels (filters) to the stream and write the results into an output stream. In case of several kernels, the output stream of the leading kernel is the input stream for the following kernel (see Figure ). he vast majority of general-purpose PU applications use only fragment programs for their computation. In this case textures are considered as input streams and the texture buffers are output streams. Because fragment processors are SIMD architectures, only one program can be loaded at a time. pplying several kernels thus means to do several passes. typical PPU program is structured as follows [, ].. Data-parallel sections of the application are identified by the programmer. Each such section can be considered a kernel and is implemented as a fragment program. he input and output of each kernel is one or more data arrays, which are stored in textures in PU memory.. o invoke a kernel, the range of the computation (or the size of the output stream) must be specified. he programmer does this by passing vertices to the PU. typical PPU invocation is a quadrilateral (quad) oriented parallel to the image plane, sized to cover a rectangular region of pixels matching the desired size of the output array.. he rasterizer generates a fragment for every pixel location in the quad, producing thousands to millions of fragments.. Each of the generated fragments is then processed by the active kernel fragment program. he fragment program can read from arbitrary texture memory locations but can only write to memory locations corresponding to the location of the fragment in the frame buffer.. he output of the fragment program is a value (or vector of values) per fragment. his output may be the final result of the application, or it may be stored as a texture and then used in subsequent passes. his feedback loop is realized by using the output buffer of a completed pass as input texture for the following one (known as render-to-texture (R)).. Mapping Onto the PU rchitecture In this section we describe how the Smith-Waterman algorithm can be efficiently mapped onto a PU. We take advantage of the fact that all elements in the same anti-diagonal of the DP matrix can be computed independent of each other in parallel (see Figure ). hus, the basic idea is to compute the DP matrix in antidiagonal order. he anti-diagonals are stored as textures in the texture memory. programs are then used to implement the arithmetic operations specified by the recurrence relation. ssuming, we are aligning two sequences of length M and K with affine gap penalties on a PU. s a preprocessing step both sequences and the substitution matrix are loaded into the texture memory. We are then computing the DP matrix in M + K rendering passes. In rendering pass k, k M +K, the values H(i, j), E(i, j), and F (i, j) for all i, j with i M, j K and k = i + j are computed by the fragment processors. he new anti-diagonal is

5 k k k the DP matrix into a quad. From Figure we can see that drawing a diagonal quad that covers the diagonal in the buffer does not yield the desired result. his is because all pixels touched by the quad are rendered instead of rendering only the cells on the diagonal. ells on the diagonal we want to render (in gray color) Unwanted cells (in red color) are rendered ) Figure. Data dependency relationship in the Smith-Waterman DP matrix: each cell (i, j) depends on its left neighbor (i, j ), upper neighbor (i,j) and upper left neighbor (i,j ). herefore all cells along antidiagonal k can be computed in parallel from the anti-diagonals k and k. stored in the texture memory as a texture. he subsequent rendering pass then reads the two previous antidiagonals from this memory. i i i i i i i i i iteration i=k iteration i=k+ iteration i=k+ Figure. yclic change of the functions of buffers, B, and for computation of antidiagonals in the DP matrix. Since diagonal k depends on the diagonals k and k, we store these three diagonals as separate buffers. We are using a cyclic method to change the buffer function as follows: Diagonals k andk are in the form of texture input and diagonal k is the framebuffer. In the subsequent iteration, k becomes k, k becomes k, and k becomes k. his is further illustrated in Figure. n arrow pointing towards the fragment program means that the buffer is used as texture. n arrow pointing from the fragment program to a buffer means that the buffer is used as framebuffer. nother concern is the way to map each diagonal in Figure. Drawing a quad over the antidiagonal results in rendering too many pixels. In our implementation, the cells of each diagonal are brought into columns by shifting each row of the matrix to the right by its row number i (see Figure ). i \ j \ offset=i Figure. Matrix with shifted rows. With this method, a L(k) quad can be rendered in each iteration, where L(k) denotes the length of diagonal k. Furthermore, it is possible to perform N pairwise comparisons at the same time by using twodimensional buffers instead of one-dimensional buffers. his is shown in Figure in which the buffer is filled from bottom up. Each buffer contains N diagonals of length L(k), where L(k) denotes the length of diagonal k. he computation is invoked by drawing an N L(k) quad in each iteration.

6 buffer buffer B buffer D D DN D D DN D D DN k k k Figure. Buffers at iteration step k= after rendering N diagonals of length each in parallel (in different fragment processors). Initialization PI and window check hardware shaders textures buffers render targets Doing Smith-Waterman lignment activate fragment program Passes Loop (for each diagonal) determine quad coordinates determine texture mappings bind buffers D x- and D x- as textures set render target to buffer D x set uniform variables draw vertices and texture coordinates release texture buffers End Passes Loop deactivate fragment program read results from framebuffer End lignment Figure. Our PU implementation for the Smith-Waterman algorithm with multiple passes. Figure gives an overview of our PU implementation. Before entering the loop that handles the passes which need to be rendered, the fragment program is activated. he following loop is executed for each diagonal. First, the quad and its texture coordinates are determined and the two buffers representing the diagonals D k and D k are bound as textures. he render target is set to the buffer that represents diagonal D k in this pass. Subsequently, all necessary uniform variables are set, the quad is drawn, and its texture mappings are set. When the rendering has finished, the buffers are released from their texture bindings and the loop restarts. his is done until all diagonals are completed. fter exiting the loop, the fragment program is deactivated and the result is read from the framebuffer. Furthermore, the results are stored and statistical information is generated. Note that the purpose of our Smith-Waterman PU implementation is the acceleration of sequence database scanning. his application requires the alignment of a query sequences to all subject sequences of a given database. ll subject sequences are ranked by the maximum score in the DP matrix. Reconstruction of the actual alignment (the traceback) is merely performed for the highest ranked sequences. herefore, only the highest score of each pairwise alignment is computed on the PU. Ranking the compared sequences and reconstructing the alignments are carried out by the front end P. Because this last operation is only performed for very few subject sequences, its computation time is negligible.. Performance Evaluation We have implemented the proposed algorithm using the high-level PU programming language LSL (OpenL Shading Language) [] and evaluated it on the following graphics cards: - nvidia eforce O: MHz engine clock speed,. Hz memory clock speed, vertex processors, fragment processors, MB memory. - nvidia eforce X : MHz engine clock speed,. Hz memory clock speed, vertex processors, fragment processors, MB memory. ests have been conducted with these cards installed in a P with an Intel Pentium. Hz, B RM running Windows XP. performance measure commonly used in computational biology is cell updates per second (UPS). UPS represents the time for a complete computation of one entry of the matrix H, including all comparisons, additions and maxima computations. We have scanned the Swiss-Prot protein databank (release., which contains, sequences with an average length of ) for query sequences of various lengths using our implementation on a eforce O and a eforce X. hey allow handling query sequences up to a length of. his restriction is imposed by the maximum texture buffer size of these graphics cards. However, this limitation is not severe since.% of the sequences in the Swiss-Prot database are of length <. Furthermore, it is expected that the texture

7 able. Runtime comparison for scanning the Swiss-Prot database (release., March ). he query sequences have accession numbers O, P, P, QZB, P, P, QHP and P in the Swiss-Prot. Query Sequence Length OSERH SSERH eforce O eforce X (sec) (sec) (sec) (sec) buffer sizes will increase in next-generation graphics hardware. We have also compared the performance between our PU implementation and a widely used PU program for database scanning - FS []. FS stands for FS-ll, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison between a query sequence and a large database of known sequences. OSERH and SSERH [] are two Smith-Waterman implementations in FS programs. OSERH is a straightforward Smith-Waterman implementation. SSERH [] is an optimized implementation of Smith-Waterman algorithm. However, OSERH is more sensitive and accurate. We have run OS- ERH and SSERH on an Pentium. Hz PU running Red Hat Linux.. able reports the runtime of our PU implementation, OSERH, and SSERH for different query sequence lengths. Figure compares the corresponding MUPS performance values. s can be seen, our PU implementation achieve speedups of almost compared to OSERH and compared to SSERH while produce the same accuracy as OSERH, which has higher quality than SSERH.. onclusions and Future Work In this paper we have demonstrated that the streaming architecture of PUs can be efficiently used for biological sequence database scanning. o derive an efficient mapping onto this type of architecture, we have reformulated the Smith-Waterman algorithm in terms of computer graphics primitives. he evaluation of our implementation on a high-end graphics card shows a speedup of almost sixteen compared to a Pentium IV Performance (M UPS) OSERH on a Pentium.Hz SSERH onapentium.hz OurPU Implem entation on a eforce O OurPU Implem entation on a eforce X Query sequence length(realam ino acids) Figure. Performance comparison (in MUPS) for scanning the Swiss-Prot database (release., March ). he query sequences have accession numbers O, P, P, QZB, P, P, QHP and P in the Swiss-Prot..Hz. o our knowledge this is the first reported implementation of the Smith-Waterman algorithm on graphics hardware. he very rapid growth of genomic databases demands even more powerful parallel solutions in the future. Because comparison and alignment algorithms that are favored by biologists are not fixed, programmable parallel solutions are required to speed up these tasks. s an alternative to inflexible special-

SE8393 Introduction to Bioinformatics Lecture 3: More problems, Global lignment DN sequencing Recall that in biological experiments only relatively short segments of the DN can be investigated. To investigate

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

EUROGRAPHICS 2005 STAR State of The Art Report A Survey of General-Purpose Computation on Graphics Hardware Lefohn, and Timothy J. Purcell Abstract The rapid increase in the performance of graphics hardware,

Chapter 2 GRAPHICAL PROCESSING UNITS 2.1 Overview Knowledge of the operations supported by GPUs and how data is processed in GPUs is necessary in order to understand how GPUs can be leveraged for cryptographic

Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled

Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

Shattering the 1U Server Performance Record Supermicro and NVIDIA recently announced a new class of servers that combines massively parallel GPUs with multi-core CPUs in a single server system. This unique

Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

1 2 Prior presenters have well explained the MLAA algorithm and some implementation approaches, as well as some of the motivations for its use (alternative to MSAA, lower memory, application to deferred

OpenGL Performance Tuning Evan Hart ATI Pipeline slides courtesy John Spitzer - NVIDIA Overview What to look for in tuning How it relates to the graphics pipeline Modern areas of interest Vertex Buffer

Impact of Modern OpenGL on FPS Jan Čejka Supervised by: Jiří Sochor Faculty of Informatics Masaryk University Brno/ Czech Republic Abstract In our work we choose several old and modern features of OpenGL

COMP27112 Computer Graphics and Image Processing 2: Introducing image synthesis Toby.Howard@manchester.ac.uk 1 Introduction In these notes we ll cover: Some orientation how did we get here? Graphics system

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

B2.53-R3: COMPUTER GRAPHICS NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions. 2. PART ONE is to be answered in the TEAR-OFF ANSWER

A Short Introduction to Computer Graphics Frédo Durand MIT Laboratory for Computer Science 1 Introduction Chapter I: Basics Although computer graphics is a vast field that encompasses almost any graphical

High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to