Overview

Please note that this page is not about what Elk language imposes on compiler implementation. Instead, our "specific" compiler implementation is described. Some design decisions may not be the best ones in a long term, but were a reasonable ones for quickly prototyping a research compiler. The main purpose of this page is to provide useful information for the readers so that they can decide whether our compiler framework would work for their purpose.

Output C++ Code

Elk compiler is a source-to-source compiler: instead of assembly codes, it generates C++ codes that are designed to be architecture independent. For each core, a separate C++ code is generated each of which has its own main function. This is in contrast to spawn-join-based concurrent programming models such as pthread, where the control begins in a serialized form, and additional execution contexts are spawned on demand. In our implementation, each core has its own execution context from the beginning and has their own address space. In some aspects, our model is similar to that of MPI. This is not a coincidence because Elk is a stream programming system where the major communication primitives are DMA operations, which can be view as one form of message passing. If your target architecture and tool chains have better support for pthread-like concurrency models, StreamIt can be a better tool for you since one of the main output format generated by StreamIt uses pthread. However, if your target architecture and tool chains have better support for DMA operations with software-managed local memory, we think that Elk is a better choice.

In the generated C++ code, architecture dependencies are mainly from how to express streaming (i.e., DMA) operations. We abstract these architecture dependent parts in two levels: concurrent queue library and intrinsic functions. Most of architecture dependent parts can be expressed by our concurrent queue library interface, which being the higher level among the two abstractions. The lower level abstraction, intrinsic functions, also support primitives that can implement shared-memory semantics: although each per-core C++ code has its own address space (i.e., variables with the same name in different per-core C++ codes do not reference the same address), we can share variables by using certain intrinsic functions.

Concurrent Queue Interface

The main abstraction for architecture dependent parts is concurrent queue interface. There are three types of queue: localQueue, sendQueue, receiveQueue. A localQueue is a queue for actors that are mapped to the same core. A sendQueue is a queue through which an actor sends data to its consumer(s) in remote cores. A receiveQueue is a queue through which an actor receives data from its producer(s) in remote cores. Each queue has names, and sendQueues and receiveQueues with the same name are logically a single concurrent queue: e.g., a datum enqueued to a sendQueue with name "foo" will be dequeued by a receiveQueue with name "foo". The primary communication method in C++ code output from Elk compiler is these queues that share the same name. The following code snippets show the interface of single-producer and single-consumer (SPSC) sendQueue and receiveQueue.

Intrinsic Functions

Will have content shortly...

Front-end

We use ANTLR parser generate to parse .elk files and generate an abstract syntax tree (AST) of the whole program. Then we generate three address code (TAC) and control flow graph (CFG) for each actor code as an intermediate representation (IR).

Important Transformations

Bundle Flattening

We flatten the actor hierarchy that is expressed by bundle constructs in Elk language. Note that bundle construct is for facilitating code reuse and modularization. During compilation, it is often more convenient to flatten the hierarchy.

Constant Propagation

During the bundle flattening, we propagate constants through the hierarchy and within actor computation code. In many occasions, we use Janino embedded Java compiler to fold a complex expression into a constant value.

Rate Analysis

We propagate real-time constraints associated with streams or actors with rate construct. For example, if there is a low-pass filter actor followed by a high-pass filter, and the real-time constraint of low-pass filter's input stream is 1KHz, we propagate the constraint to the high-pass filter and set the real-time constraint of the high-pass filter to 1KHz as well (we assume that both low-pass filter and high-pass filter consume and produce 1 stream token per firing).

Estimate Actor Computation Rates

We estimate the number of instructions per actor execution using a static analysis. We aggressively apply compiler optimizations such as conditional constant propagation, dead code elimination, and partial redundancy elimination to consider optimizations that will be performed the C++ compiler. When we encounter an if-else branch, we select the longest path. When we encounter a loops whose iteration count cannot be determined at compile-time we mark the actor as a "dynamic" actor (to override this, the programmer can manually specify the computation rates of certain actors). By the combination of results from the previous rate analysis phase and this phase, we can associate each actor with the computation requirement to satisfy the real-time constraints. For example, if the propagated rate for a DCT actor is 1KHz and the estimated actor computation rate is 400Hz, the computation requirement for the DCT actor is 2.5 cores.

Actor Fission/Fusion

We fusion and fission actors so that we use the minimum number of cores to satisfy the real-time constraints while maintaining a good load balance. If real-time constraints are specified by the programmer, we fission stateless actors so that computation requirements are satisfied. For example, if a DCT actor requires 2.5 cores, we replicate the DCT actor three times. After actor fission, we input the modified application graph with replicated actors to a graph partition software called METIS to fuse actors with small computation requirements to the same core to achieve a good load balance.

Team Scheduling (Actor Aggregation and Amortization)

After actor fission and fusion, we aggregate certain actors that are mapped to the same core to a single actor so that synchronization overhead between them can be eliminated. We also multiply the number of executions of actor per synchronization (queue empty or full checks) for certain cases to amortize the overhead associated with synchronizations and DMA initiations. This optimization significantly improves the throughput while satisfying the buffer-space constraints from limited local memory space. The key novelty of our approach is a generic buffer-space computation method that finds the minimum buffer capacities that avoids serialization or deadlock. This allows us to flexibly apply actor aggregation and amortization in an arbitrary order. The details of this phase is described in our SPAA10 paper.

Back-end

Physical Core Assignment

In the actors-to-cores mapping resulting from actor fission/fusion phase, the core numbers are virtual core numbers. In this phase we find a virtual-cores to physical-cores mapping that minimizes the communication distances. We use a simulated annealing method. For more details, refer to David Black-Schaffer's PhD thesis.