Mark/Teresa: Title (TBD)

Steve: CURRENT PROGRESS AND PLANNED ACTIVITIES

Orion Darkroom, DPDA (TBD)

[This is intended to be a short paragraph, just a sentence or two briefly summarizing the current status of the Stencil Engine, with the more detailed information later in the "Individual tasks" section.]

We continue to develop and debug both Orion, our computational photography DSL, and our DPDA-to-hardware compiler (the Stencil Engine Generator). We have been mapping algorithms of higher complexity to the flow, resulting in increased running time. To address this issue we have instituted a hierarchical flow that splits the design into several smaller pieces and synthesizes them in parallel on a cluster. This, along with other continuing work, should help to extend the hardware generator to support image processing algorithms that operate in three dimensions or operate on pyramids. For example, SIFT is a common feature tracking algorithm that uses a pyramid for scale invariance. On the other hand, multi-frame HDR requires the registration of multiple frames and an integration of those frames in time. See the Stencil Engine Generator section below for more details.

Short summary/overview of FPGA Platform (TBD)

Our prototype FPGA platform now has a VITA sensor in place of the earlier USB camera. We continue to map algorithms of higher and higher complexity to the platform, resulting in necessary optimizations to the design flow; see the "FPGA platform" section below for details.

FPU-Generator Test Chip (TBD)

In August 2013 we taped out a test chip based on our floating-point generator. This quarter, in mid-February, we received chip samples from the manufacturer. After that, we finished the chip packaging and test board assembling. We have done a full set of functional tests, and it turns out that the functionality of our chip and board is good. The fastest FPU module (the single precision, CMA design) can run up to 1.2GHz and consumes 25mW power. For the next step, we want to run some performance debugging on the chip and then to collect the energy and performance data we need for an in-depth analysis of the efficiency of the FPU design.

Jing: see the attachment sent to Steve.

Individual Tasks

We have also made progress on each of the individual tasks, which we describe below.

HARDWARE GENERATORS (Final)

Our mantra in creating hardware is that engineers should create constructors and not instances. We call these hardware oriented constructors generators. To hierarchically create bigger and bigger generators, we use Genesis2, an extension for SystemVerilog that includes a software-like elaboration layer. A generator is very different than high level synthesis (e.g. C-to-Verilog flows). HLS turns a C program into hardware, such that a given set of inputs produces the same output in both the C code and the hardware. In a generator however, the program is not a functional description of an algorithm, but rather a procedural description of how a piece of hardware needs to be built. Put differently, the inputs to the generator program are architectural parameters, and its output is the hardware description.

Floating Point Generator

Jing: Test Chip (DONE)

This quarter we continued testing the FP chip. We built a second version of the test board, which fixed some noise decoupling issues and improved the peak performance of the chip. Then Jing Pu did more comprehensive measurement on different chip samples under multiple operating conditions. At nominal operating condition (1.0V supply and zero body bias), the four FPUs on the test chip achieved maximum frequencies of 1.1GHz, 0.97GHz, 1.5GHz and 0.81GHz, respectively. Looking at the effect of body biasing, we observed 20% total energy savings comparing to nominal when we maintained the same throughput but reduced the supply voltage and forward-biased the chip. The table below compares one of our FPU's to Intel’s variable-precision FMA. Our FPU is 50% more energy efficient and costs only half the die area.

(See docx file for word-formatted version.)

FPChip:FPU2 (single precision, CMA)

Intel's Variable-precision FMA (ISSCC 2012)

Technology

28nm SOI

32nm HKMG

Voltage

1.05V (zero body bias)

1.05V

Pipeline depth

6

3

frequency

1.45GHz

1.45GHz

Total Power

37.8mW(2FLOP/cycle)

56mW (2FLOP/cycle)

Leakage

0.5mW

0.44mW

Energy efficiency

77GFLOPS/W

52GFLOPS/W

Area

0.023mm2 mapped area: 0.017m^2)

0.045mm2

Transistor count

~2000 cells

120k

Xuan: FP Generator (Done)

In ongoing FPGen work, we set up a synthesis flow for floating point division and square root, and generated timing, area and power reports. We are in the process of merging the division and square root unit with floating point generator.

Xuan: Convolutional Neural Nets (CNN) (Final)

We have recently begun leveraging the group's work in stencils and convolution to start looking at producing a generator for Convolutional Neural Networks (CNN). Xuan Yang, in particular, started visualizing the hardware and memory needed for this problem. Sometime after winter quarter we realized that wire energy could be a problem for building an energy efficient chip for this problem, so now we have repartitioned the problem, and are able to solve the wire energy issue. We also built a model to understand the trade-off between the memory storing the image data
and memory storing coefficients, and the trade-off between memory and hardware based on this partition.

Xuan: Linear Algebra Generator (TBD)

This quarter we built the core of Linear Algebra Generator, with a micro-programmable controller inside of each PE (process element). We also designed a new architecture for the controller than can optimize the hardware for matrix multiplication, which will be implemented soon.

John: Stencil Engine Generator (DONE)

We continued work on the flow that turns Darkroom applications into Stencil Engines. (Darkroom, formerly called Orion, is a domain-specific language targeted for image-processing algorithms.)

We have completed the majority of work on a new line-buffer meant to be more area efficient in ASIC designs, as well as requiring much lower resource utilization for FPGA-based designs. This line buffer reorganizes how data is stored in the SRAM or BRAM array ina way that eliminates additional structures otherwise required for the specific stencil data flow. We plan to explore the specific trade-off space of the original design against this new design in the upcoming quarter.

Additionally, we have completed an analysis of the design with an eye toward localizing energy and area overheads relative to lower bound estimates. We found a significant overhead caused by over-margining the timing. While provisioning more margin increased the likelihood that a functional design would yield well from the flow with little intervention, this margin also increased design cost. At the time of this writing we have completed most of the work needed to allow for much smaller timing margins while retaining a hands-off design flow for users.

Next quarter we plan to finalize design documentation and open source the project in time for its presentation at SIGGRAPH. Additionally, we are looking to bring up new applications and features on the Orion Darkroom-to-Stencil-Engine flow in order to make the system more attractive to other researchers. Specifically, we are working on a behavioral model of the design to facilitate design iteration, design cost prediction, and precision optimization.

Artem/Steven: FPGA Platform and Drivers (DONE)

The FPGA platform is an effort to quickly prototype the Stencil Engine design flow, including a VITA sensor that directly drives processing logic on a Zynq board, with results displayed through HDMI on a “viewfinder” and a 4/3 lens mount. The platform will be used to implement several essential algorithms like auto-focus and auto-exposure. We also plan to integrate a new sensor such that, once the code is finished, we can swap out the current sensor's daughter card and switch to the new sensor.

This quarter we integrated the VITA-2000 image sensor controller, HDMI display controller, and Darkroom-generated hardware into a single hardware design. We extended our Linux driver software to support continuous processing with a set of large contiguous buffers, and stitched the drivers together to pass images between the camera and image processing hardware. The result is a full-HD (1080p) video stream running through a stencil path generated from Darkroom, with a live output display on an HDMI monitor.

To enable more complex hardware configurations, we ported our design to the larger Xilinx Zynq ZC706 FPGA platform, which has roughly 4x the FPGA resources of the previous ZC702 board.

Andrew: Generator methodology and the Driver Generator

As a practical application for our ongoing generator methodology research, we have put together an automatic driver generator for the Stencil Engine. In particular, IP generators like the Stencil Engine can be built with knowledge about the bandwidth requirements of each of their individual elements, and that knowledge can be used to automatically generate a Unix driver. Andrew Danowitz has been spearheading the work to build such a driver generator for the Stencil Engine. His implementation makes use of the Linux driver model. While it specifically targets the Stencil Engine architecture, the observations and mechanisms used apply to the broader problem of software generation for fixed hardware accelerators.

Team member Stephen Bell created a driver template for basic access patterns available in the Linux operating system. These features represent a fairly standard Linux driver. Using information produced from the Stencil Engine generator, we fill in the template and automatically build the necessary driver(s) for the generated Stencil Engine hardware. The features provided by the driver generator offer full software-level access to the ISP generator hardware. Programs can easily transfer images stored in memory to the accelerator, and have them processed in real time.

While the specialized template currently only works with John
Brunhaver’s Stencil Engine, the techniques used are fairly general.
It is likely that after building driver generators for a number of IP blocks,
we can identify commonalities and design patterns among the various types of driver functions
implemented. This would allow us to create a more general templated
template for creating efficient drivers across a wide range of IP blocks.

DOMAIN SPECIFIC LANGUAGES

Kunle/Kevin: Delite DSL infrastructure (DONE)

We are continuing our work studying the performance characteristics of graph analytics problems using our DSL OptiGraph as a platform for generating different low-level implementations. We implemented a dynamic load balancing scheduler in Delite to help combat load imbalance observed when performing parallel operations over skewed graphs, and have shown speedups of 2-5x over GraphLab for classic applications such as PageRank and triangle counting. We have also observed that the internal layout of the graph in memory can have a very large impact on overall application performance and that choosing a single layout for the entire graph (every set of neighbors) is often suboptimal. We are currently investigating which layouts work best in which situations as well as how to automatically select the best layout(s) within the DSL based on the graph structure.

We are also working on ways to make Delite DSLs more user-friendly and easier to debug. As a first step in this process we have implemented a visual performance debugger that breaks down application execution time by kernel and thread. In addition to raw execution times it also computes multiple statistics such as thread percent idle time, communication costs, and memory usage over time. It also includes the application source code in the same window and automatically maps the execution of the generated code back to the corresponding source code. In order to accomplish this last piece we added the ability within the Delite compiler to automatically obtain the application source info (line number, etc.) for each call of a DSL operation and keep this information around through the various transformations and optimizations Delite performs.

Over the last quarter, I have sketched out block diagrams that implement CESK [1] machines using digital logic. The next steps are to implement a simulation, test, and identify opportunities for optimization. In particular, representing call stacks and environments can take up a lot of space. I anticipate that the ways to deal with this will mainly consist of using static analysis to identify a smaller representation of environments and call stacks at compile time, not unlike techniques used for optimizing functional programming languages for CPU's.

Over the last quarter, I formulated an approach that realizes the notion of learning to sample efficiently. My approach is based on the fact that the steps of a systematic search algorithm themselves provide a lot of information about the landscape of the distribution; one only needs a way to competently extract such information, and then in the process of producing even one point in the support, much will be learned about the distribution. In a first cut, I used the fact that there are certain subgroups of the satisfaction-preserving automorphism group on assignments induced by a propositional probabilistic formula (a probabilistic program trace or graphical model) that can be computed very efficiently. This leads to each step of systematic search potentially ruling out (or in) a number of assignments that is proportional to the factorial (which is generally how big groups are) of the number of search steps taken so far.

The results are that so far, on a problem with a sum constraint, this approach took around 3 orders of magnitude fewer search steps to find an approximation of the normalizing constant than a naive enumeration approach.

3. Shred.js: tracing probabilistic programs in Javascript

Over the last quarter, I implemented the slicing and allocation removal semantics of the AISTATS 2014 paper.

[1] Matthias Felleisen. The Calculi of Lambda-v-CS Conversion: A Syntactic Theory of Control and State in Imperative Higher-Order Programming Languages. Ph. D thesis.

Mark/Teresa: ADMINISTRATION (TBD)

Equipment

No major equipment purchases made during the reported period.

Personnel Changes

Ofer Shacham left Stanford to take a position with Google. He will continue to consult with the project, and support the Genesis II tool he created.

Information from Trips

Conversations with many companies continued. We are working closely with NVIDIA and Google about various aspects of the hardware, and discussed the imaging part of the project with both ST and Sony.