We continue to develop and debug both Orion, our computational photography DSL, and our DPDA-to-hardware compiler. We are now able to produce DPDA (Stencil Engine assembly) code from Orion, and our hardware compiler can take DPDA code and create hardware. Additionally, the compiler output is functionally correct in both RTL simulation and when running on the FPGA, allowing us to evaluate resource utilization for FPGA and ASIC implementations. We are now improving the compiler and system to reduce resource utilization, and we are implementing features to support new stencil applications.

We have demonstrated DPDA flow running on an FPGA platform with an attached USB camera and have made progress on VITA sensor integration into the flow. Unlike with USB camera where DMA transfers an image first into DRAM and then into the processing engine, the VITA sensor will drive the processing pipeline directly without touching either the memory or the software driver. We expect to finish this effort within the next two weeks.
The same direct-drive scheme will be used with our final 10MPx sensor upgrade.
We have received the verilog interface for an Aptina sensor and started integrating it into our platform. This requires code adaptation, creation of simulation environment and, later, a custom PSB board that will host the sensor and interface to the FPGA.
We have also just received an upgraded Zynq ZC706 development board, which is based on a larger chip and is similar in size to the Zynq 7100 we plan to use in the future.
In the current design flow, using constrained placement, we have increased the clock rate from 100 to 150MHz, currently limited by the DMA block.

Last period we completed and taped out the design of a floating point test-chip. We are still waiting for the chip, but we haven't been given a final delivery date. The original expected delivery was in January, but it might be delayed. Meanwhile, we have decided to use a Leadless Chip Carrier (LCC) for packages, and have contacted several suppliers.

Our mantra in creating hardware is that engineers should create constructors and not instances. We call these hardware oriented constructors generators. To hierarchically create bigger and bigger generators, we use Genesis2, an extension for SystemVerilog that includes a software-like elaboration layer. A generator is very different than high level synthesis (e.g. C-to-Verilog flows). In HLS, a C program synthesizes to hardware, such that a given set of inputs produces the same output in both the C code and the hardware. In a generator however, the program is not a functional description of an algorithm, but rather a procedural description of how a piece of hardware needs to be built. Put differently, the inputs to the generator program are architectural parameters, and its output is the hardware description.

In August 2013 we taped out a test chip based on our floating-point generator. As mentioned earlier, we hope to see silicon back from the manufacturer in January, but will not be surprised to see further delays at their end.

This quarter, we continued our project of expanding the FP generator to include division. We added support for underflow and overflow error cases, and we can now compute all the bits before the least significant bit correctly. We have figured out how to get the last bit rounded correctly with the smallest hardware cost and are now implementing that.

This quarter we started work on a Linear Algebra Generator, which will use the floating point generator and will exploit a multi-level memory hierarchy. We expect it will be very highly efficient in terms of both computation and memory. We are in the process of building RTL for GEneral Matrix-matrix Multiplication (GEMM). The plan is to have 4X4 GEMM working by the end of the quarter. After that, we will move on to Cholesky matrix factorization and FFT.

We continued to work on the design flow that turns Orion domain-specific-language code into Stencil Engine hardware. We submitted a paper for publication presenting an argument for embedding domain specific information into a hardware design framework based on our results with several different applications including a prototype image signal processor for a camera, a Canny edge detector, a Harris interest point detector, and a FAST interest point detector. Additionally, we have cleared many of the functional issues related to running these RTL designs on our Zynq-branded FPGA fabric. Some of our effort was spent improving the resource utilization of the FPGA result and the energy/area ratio of the ASIC result. Additionally, we have tested and begun optimization of algorithms for image segmentation, optical flow, deconvolution, and a new image signal processing pipeline.

We started work on, and this quarter we plan to finish, an initial prototype of the FPGA platform that will implement the Stencil Engine, with all major components working: a VITA sensor that directly drives processing logic on the Zynq FPGA board; results displayed through HDMI on a “viewfinder;” and a 4/3 lens mount. We plan to use this platform to implement several essential algorithms like auto-focus and auto-exposure. We will also work on integrating a new sensor such that, once the code is finished, we can swap out the current sensor's daughter card and switch to the new sensor.

Today an SoC designer inevitably runs into an interface mismatch between the various IP blocks in their system. While many bus standards have been developed to standardize IP interfaces, the diversity of standards causes a new problem: it is unlikely that all blocks in a system will use the same standard. This bus diversity is often caused by bus evolution. New, better buses are developed, but often one needs to use an older IP block that has the previous generation bus interface. This problem is fundamental to SoC design: the designer of a block cannot know about the environment where that block will be used because that environment does not exist when the block is created.

While this problem seems fundamental, the recent design trend towards building flexible hardware, or hardware generators, provides a possible solution. Last quarter, we proposed a parameterized bus interface generator, which can connect the interface signals in each IP block to the current bus abstraction. Since the mapping between the IP block and the system bus is done algorithmically, this mapping will be maintained as both the system bus and the generator evolve to support more complex operations. Thus this parameterized IP interface allows an IP block to connect to the system interconnect in existing systems, while at the same time offering the flexibility to adapt to new interfaces as bus standards evolve over time.

Specifying the interface through a set of architectural parameters comes with many benefits. By leaving bus implementation details flexible, IP designers achieve a high degree of reusability in their blocks. Parameterized interfaces also free system integrators from needing to know low-level details about the different IP interfaces. Instead, integrators can ask for the interface they need.

We spent this quarter building a proof-of-concept system capable of bridging across two interfaces described using our interface parameter set. To limit the complexity of our test system, for each parameter we developed an enumerated list of common implementation options, and require that any interface definition given to our system uses only these enumerated values. We examined several industry bus standards in developing our enumerations to ensure that our system still has the flexibility to represent a wide range of buses.

We used our system to generate bridges among AXI AMBA standards including AXI, AHB-Lite, and APB, and IBM’s OPB standard. We also wrote a draft paper on this system that is currently under review for DAC 2014.

Previously we implemented a meta-DSL, Forge, for declaratively specifying new DSLs and then re-implemented our existing Delite DSLs using Forge, which resulted in a 3x to 6x reduction in the lines of code for implementing each DSL. Using Forge we were able to prototype two new DSLs relatively quickly in the past quarter.

OptiWrangler is a DSL for data cleaning and preparation, helping the user turn unstructured data-in-the-wild into well-structured data that can be consumed by another DSL to perform the next phase of computation. The DSL is based on the existing DSL DataWrangler (http://vis.stanford.edu/wrangler/), and by implementing the DSL using Delite it should scale to much larger input data sizes than is currently possible with the existing Python and JavaScript implementations. OptiGraph is a DSL for performing graph analyses, such as computing the PageRank or betweenness centrality of vertices, and is optimized for “small-world” (e.g., social network) graphs. We are currently working on expanding the set of operations these new DSLs provide.

We are also working on improving the coverage and efficiency of Delite’s parallel code generation for multiple hardware targets through more advanced static analysis. We are currently investigating how to automatically generate nested parallel computations to fully exploit hardware with nested levels of parallelism (e.g., GPUs). Our previous approach could only exploit a single dimension of parallelism which is not always sufficient to saturate the available hardware and maximize performance. We are also working on extending these same analyses and transformations to clusters of GPUs as well.

This past quarter, we submitted a paper to AISTATS 2014 on generating efficient MCMC code from probabilistic programs. We improved the exposition of the paper and presentation of the results. Out of the improvements made this quarter, we also sped up MCMC kernels that employed Dirichlet processes by moving more of the mechanisms to the generated C++ code.

We are also working on a compiler for probabilistic programs that generates formulas in a theorem prover such as Satisfiability Modulo Theories (SMT). The overall idea is that the systematic search techniques of the theorem prover may prove more efficient for certain classes of inference problems, given the right encoding.

The main project this quarter has been exploring the viability of this idea in a specific setting: the generation of constrained layouts for use in computer graphics. In this project, we are concerned with a restricted subset of stochastic lambda calculus that works over only integer- and boolean-valued random variables, called Solitaire. The Solitaire compiler generates a formula in SMT which is likewise restricted to variables over integers/booleans. Satisfying assignments represent executions of the probabilistic program in the support of the posterior distribution. These are used to encode parameters of a 3D model or layout. We have shown that this method can efficiently generate (in ~1 minute) arrangements of geometric puzzle pieces, platform/walkway arrangements, and tile-based strategy game maps. These are all domains that include many hard constraints over their parameters. We plan to submit this work to SIGGRAPH 2014.

After the SIGGRAPH 2014 submission, we plan to resume investigation of non-reversible MCMC kernels for probabilistic programs. In particular, we would like to investigate probabilistic analogs of abstract machines. Our hope is that this leads to insights for both the development of non-reversible MCMC kernels and the design of probabilistic hardware.