High Level synthesis (HLS)
High Level synthesis tools convert applications written in a high-level language like C to Hardware Description Languages like Verilog and VHDL. Some of the advantages of using high level synthesis are as follows:

It is easier to modify high-level code than HDL code; the application can thus be easily optimized

Functional behavior can be verified through software simulation

Applications developed in a high-level language can be easily ported to FPGAs

A hardware testbench can be generated directly from the software testbench

Development time is greatly reduced

This Blokus example used Impulse C to construct parallel computational blocks called processes which communicate using various interfaces. These communication interfaces, provided as library components with the tool, include streams, shared memory, semaphores, and messages.

The compiler also provides features, in the form of compiler pragmas, for fine-grained optimization of the design. This enables the developer to concentrate entirely on coarse grained optimization like load balancing amongst the various processes. Some of the fine grained optimization features used in this project are as follows:

PIPELINE: This is used to pipeline loops where each iteration takes more than one clock cycle to complete

UNROLL: This is used to compute each iteration of a loop in parallel

FLATTEN: This is used to execute a large block of nested "if-else" statements in a single clock cycle

There are other compiler constructs that utilize the benefits of dual ported RAM through multiple read write operations wherever possible

Various utilities provided with the development toolkit enable the designer to gauge the efficiency of the generated hardware from the high-level code. One such utility is the "Stage Master Explorer," which provides information regarding:

Performance analysis
The performance of the current implementation was compared against a software solution executing the same algorithm. Following are the implementation details, performance results, and analysis:

Algorithm: Minimax-based game tree search with alpha-beta pruning

Software solution: Executes on single core of a 2.00GHz Intel Xeon processor

When the game tree is searched to a depth of 2, the speedup achieved is around 11X. The speedup varies during the course of the game as the load pattern changes. The speedup mentioned above is one among the higher values achieved when there are maximum moves to search.

When the game tree is searched to a depth of 3, the speedup achieved is around 4X. The decrease in speedup can be attributed to the following reason. Pruning of the game tree search occurs best when the latest values of alpha and beta are used, which happens in case of the sequential software solution. However, due the distributed architecture of the FPGA implementation, each process communicates only with its parent process. This results in various processes using older values of alpha and beta, thus leading to inefficient pruning and redundant search overhead.

The code for this example is available for qualified researchers (please contact the author as discussed below).

Future work
Further acceleration is possible through algorithmic improvements to the minimax algorithm, including techniques like aspiration search, iterative deepening, and distributed tree search. Architectural improvements involve exploring communication and computation trade-offs.

About the author
Nilim Sarma is a Masters Candidate in the department of Electrical and Computer Engineering at Clemson University. He is a part of Dr. Melissa Smith's "Future Computing Technologies" research group. He is currently working on optimizing algorithms for hardware accelerators like FPGA and GPGPU. He has three years of industry experience in embedded systems programming. He can be reached via nsarma@g.clemson.edu.