چکیده انگلیسی

We show a method for parallelizing top down dynamic programs in a straightforward way by a careful choice of a lock-free shared hash table implementation and randomization of the order in which the dynamic program computes its subproblems. This generic approach is applied to dynamic programs for knapsack, shortest paths, and RNA structure alignment, as well as to a state-of-the-art solution for minimizing the maximum number of open stacks. Experimental results are provided on three different modern multicore architectures which show that this parallelization is effective and reasonably scalable. In particular, we obtain over 10 times speedup for 32 threads on the open stacks problem.

مقدمه انگلیسی

Dynamic programming [2] is a powerful technique for solving any optimization problem for which an optimal solution can be efficiently computed from optimal solutions to its subproblems. The idea is to avoid recomputing the optimal solution to these subproblems by reusing previously computed values. Thus, for dynamic programming to be useful, the same subproblems must be encountered often enough while solving the original problem.
Dynamic programming can be easily implemented using either a “bottom-up” or “top-down” approach. In the “bottom-up” approach, the solution to every single subproblem is computed and stored in the dynamic programming matrix, starting from the smallest subproblems until the solution to the entire problem is finally computed. This approach is particularly simple to implement; it requires no recursion and no data structure more sophisticated than an array. It is also efficient if (a) the problem is small enough for the entire matrix to be stored in memory, and (b) the computation of unnecessary cells does not introduce too much overhead. The classic bioinformatics sequence alignment algorithms of Needleman and Wunsch [22] and Smith and Waterman [30] are generally implemented in this way, for example.
In contrast, the “top-down” approach starts from the function call to compute the solution to the original problem, and uses recursion to only compute the solution to those subproblems that are actually encountered when solving the original problem. Previously computed values are reused by applying a technique called memoization. In this technique each computed value is stored in an associative array (implemented, for example, by a hash table). Then, the recursive function tests if the value it is called for has been previously computed (and therefore exists in the associative array) and, if so, simply reuses the value rather than recomputing it. This approach to implementing dynamic programming avoids the computation of unnecessary values and is particularly effective when combined with branch-and-bound techniques to further reduce unnecessary computations [24].
Previous efforts at parallelizing dynamic programming have focused on the “bottom-up” style dynamic programming matrix, by computing in parallel cells known to have no data dependencies. For example, the Smith–Waterman algorithm has been accelerated by the parallel computation of cells in the matrix that can be computed independently by the use of SIMD vector instructions [36], [25] and [6], special-purpose hardware [23], general-purpose graphics processing units (GPGPUs) [14] and [16], or other parallel processors such as the Cell Broadband Engine [35]. More generally, Tan et al. [31] describe a parallel pipelined algorithm to exploit fine-grained parallelism in dynamic programs, and apply it to Zuker’s algorithm [40] and [15] for predicting RNA secondary structure. Subsequently, Xia et al. [37] implemented their own specific parallelization of the Zuker algorithm on FPGA hardware. Chowdhury and Ramachandran [5] describe tiling sequences (recursive decompositions) for several classes of dynamic programs for cache-efficient implementation on multicore architectures.
All these techniques require careful analysis of each particular algorithm to find the data dependencies in the dynamic programming matrix, resulting in a parallelization that is specific to each individual problem. Furthermore, they only work on the “bottom-up” approach and, therefore, can only be applied to problems for which computing every cell is feasible.
In this paper we describe a general technique for parallelizing dynamic programs in modern multicore processor architectures with shared memory. The contributions of our paper are:
•
a generic approach to parallelizing “top-down” dynamic programming approach by using
–
a lock-free hash table for the memoization, where each thread computes the entire problem but shares results through the hash table;
–
the randomization of the order in which the dynamic program computes its subproblems to encourage divergence of the thread computations, so that fewer subproblems are computed by more than one thread simultaneously;
•
an effective algorithm for a lock-free hash table supporting only insertions and lookups; and
•
experimental results showing that this approach can produce substantial speedups on a variety of dynamic programs.
The remainder of the paper is organized as follows. In the next section we describe our approach to the parallelization of top-down dynamic programs. In Section 3 we define our hash table implementations, and show their effectiveness in the case where the ratio of inserts to lookups is quite high. In Section 4 we give the results of experiments on four different dynamic programs on three different architectures, illustrating the effectiveness of the parallelization. Finally, in Section 5 we conclude.

نتیجه گیری انگلیسی

We have described a technique for parallelizing dynamic programs on shared memory multiprocessors, and demonstrated its application to dynamic programming formulations of the well-known knapsack and shortest paths problems, as well as the bioinformatics problem of RNA structural alignment, and the problem of minimizing the maximum number of open stacks. Our technique is applicable to any dynamic program, since it operates on the top-down (i.e., recursive) implementation of the dynamic program, which is a direct implementation of the recurrence relation (Bellman equation) expression of the problem. This is in contrast to previous work on parallelizing dynamic programs, which focuses on vectorizing the operations in filling in the dynamic programming matrix in the bottom-up technique.
Much greater speedups (orders of magnitude) can be achieved for specific dynamic programming problems by careful analysis of the problem structure and properties of optimal solutions, in order to apply, for example, bounding techniques. Although the parallelization technique we have described results in much more modest speedups, it can be applied immediately to any dynamic program, without the need for further analysis.
For dynamic programs that are too large to implement in the bottom-up manner (filling in every entry of a dynamic programming matrix), such as open stacks, vectorization approaches are inapplicable and a method, such as the one presented here, that is applicable to the top-down implementation is required. We have shown a speedup greater than 10 times (for 32 threads) by applying this method to a state of the art dynamic programming algorithm for the minimization of the maximum number of open stacks problem.
For problems that can be practically implemented with the bottom-up technique, such as sequence alignment and RNA structure prediction and alignment, vectorization techniques have been successful. However, these techniques require careful analysis of the data dependencies in the particular problem being parallelized and result in increased complexity of the implementation. Our method, in contrast, can be applied directly to a simple implementation of the recurrence relation defining the problem as a recursive function, without any analysis of the particular problem. In these cases an array can be used to store results, without even the need for a lock-free hash table.
In order to simplify our algorithm and implementation we made the assumption that dynamic resizing of the hash table is unnecessary, and simply allocated a very large hash table at initialization. This is clearly wasteful for dynamic programming problem instances that do not require a large number of entries. A more sophisticated implementation would use a dynamically resizable lock-free hash table, such as that provided by split-ordered lists [28]. Further experiments would have to be carried out to determine if the increase in complexity and use of indirection (resulting in reduced cache efficiency) is a winning trade-off for the advantage of not allocating unnecessarily large amounts of memory.