Accelerator Centric Computing

Introduction

Below is a set of notes that describe some design choices and ideas for the ACC architecture. There are more parameters and options which are not addressed here, and those which are, are not necessarily the result of consensus. Please contribute to this document by correcting/adding/removing at will, but maintain the structure. Don't worry about losing changes, the wiki maintains a history of this document.

Conventions

MMA: Matrix Multiply Accelerator. In this text, we simplify to the following operation: A[16x16] * B[16x16] = C[16x16] , where A, B, C are arrays of doubles

TQ: Task Queue

TQM: Task Queue Manager

RMS: Resistive Memory Substrate

Accelerator tile

Contains

Multiple accelerators

of different type. Allows for data sharing across different types of accelerators, thus implies better bandwidth utilization and a fast accelerator-pipeline implementation. However, it would result in a non-uniform access-pattern to scratchpad memory, which could lead to higher parallelization requirements (number of ports) and, perhaps, non-optimal accelerator memory controllers.

of the same type. Data sharing across accelerators of the same type does not always make sense, and the base+bound address resolution approach does not make data sharing at this level easy anyway.

One or more accelerators should be usable by different GPCs at the same time.

Scratchpad memory - highly parallel, so that it can effectively support double buffering and the bandwidth requirements for full accelerator utilization.

Communication interface. Can be used for direct, accelerator-to-accelerator communication.

Multiple memory controllers, maintaining direct connection to the resistive memory substrate, so as to accommodate the high-bandwidth requirements of the accelerators. This memory can serve as another communication medium between accelerators, especially if a tile contains solely accelerators of a certain type: in a pipeline of accelerators the preceding accelerator would have to write data back to this RMS, before the succeeding accelerator could read them (from RMS, into its own scratchpad)

For the following, whenever an assumption is made about how accelerator tiles look like, it is assumed that they are of the same type.

Resistive Memory Substrate

Shared among tiles, hence among all accelerators, and the general purpose cores.

Allocated as chunks of memory on a per-process basis : a process which needs to use the accelerators, should get sufficient memory at this level.

Appears as an explicitly managed L3 cache. It is read and written, using DMA, by the main memory controller (initiator of DMA is the task queue manager) or the accelerators' memory controllers (initiator of DMA is the accelerator).

Every process can have multiple chunks of memory at this level, but an accelerator can only access contiguous chunks, using a {base, bound} pair of physical RMS addresses.

As part of a more complete description, it is suggested in this revision of this document that this level of memory be managed by the TQM. However, it might be beneficial if it is instead managed by the operating system, as part of unified memory space In that case, writing to RMS memory can be something as simple as

which might be (automatically) replaced by memcpy calls and iniated DMA whenever possible.

Scratchpad Memory

In close proximity (physically) to the accelerators.

Shared among the accelerators of a tile - not unique for every accelerator.

Allocated by the task queue manager before the accelerator can execute.

The only memory addressable by accelerators. Every accelerator has a {base, bound} pair of physical scratchpad addresses, used for safe and fast address resolution - addresses outside the [base, base+bound) region are not accessible by the accelerator.

Limited in size, but enough to support at least the equivalent of two input-widths for every accelerator (double buffering). For example, the MMA needs 2KB(16x16xsizeof(double float)) before it can even execute, and 4KB to enable double-buffering. A multiple of this must appear on scratchpad to sustain the accelerator bandwidth requirements.

Populated through DMA requests from RMS. Requests are served by the accelerator memory controllers.

Task queues

There exists one accelerator task queue for every type of accelerator. Every task queue is unique and can be identified by its id.

Whether the task-queue appears as a blocking FIFO queue, or some other data structure, is a matter of policy implementation. Assume a FIFO queue for now. But later software running on TQ Controller can dynamically schedule tasks based on various constraints: power/performance/priority. Note that the scheduling algorithm runs in software.

The contents of the task queue lie in main memory, in the kernel address space. This allows it to expand without limitations and be protected from arbitrary user writes.

The TQ can only be read or written by the task queue manager. The only way it is visible to the system, is by expand/shrink requests in (hopefully) rare occasions.

The TQ contains task descriptors. The task descriptors are described through a software/hardware contract. Only pointers of task descriptor are stored in TQ, which makes all TQ the same.

Example: for the MMA, a task descriptor could be { &A, &B, &C, n, m, flag, &callback }, where

&A, &B are the virtual addresses, in user space, of the matrices to be multiplied, and &C is the address of the matrix where results will be saved

Matrices are of sizes A[n x m] * B[m x n] = C[n x n]

The flag can have four states: NOT_EXECUTING, EXECUTING, EXECUTED, GAVE_UP.

The callback pointer can be

a routine that will execute on the general purpose core, and should have the same result with the accelerator, should the task queue manager decide this would benefit the system

a follow-up accelerator instruction which indicates a pipeline-build-request (dedicated accelerator use mode)

null, if non of the previous applies

A task descriptor is automatically associated with a unique task id.

Task queue manager

Flexible, programmable, dedicated general purpose core.

Loaded with the task queue management policy at boot time, or through some on-the-fly policy selection mechanism.

Responsible for maintaining

accelerator usage statistics (which and how many accelerators are available),

scratchpad usage statistics (how much on every tile is available),

resistive memory usage statistics (how much is available)

Upon receiving an en-queue task request, the TQM

reads, from user address space, the task descriptor into the queue,

sets to flag to NOT_EXECUTING,

gives the task a unique id.

Upon selecting a task from the queue to execute, the TQM

Sets the flag of the task to EXECUTING.

Requests proper number of accelerators. What is proper is a matter of policy implementation, e.g. max-available if targeting performance

Allocates sufficient RMS memory chunks, if possible contiguous pages (some bin-packing algorithm should apply). If the management of memory at this level is left to the OS, it might be reused and page-able, with a wider page than the rest of the system. Also/alternatively, it can be user managed.

If the TQM realizes that the policy heuristic for the accelerator use is not met under current usage constraints, it sets the flag to GAVE_UP, calls the callback function on the GPC and moves to the next task, else ...

TQM initiates DMA requests to copy data from the virtual address space of the process to the RMS. Should an address resolution error happen during this time, hence a page-fault, it would need to be managed by the OS. A blocking task queue implementation would automatically try to work on another task in the meantime. Moreover, an address resolution table could be maintained on a per process basis, by the TQM, for every process which wants to use the task queue.

Reads the task descriptor and instructs the memory controller to initiate DMA requests from user address space to the RMS.

The task queue controller is responsible for defining how big chunks of memory must be transferred to the RMS before an accelerator is actually started. The minimum size is apparently an accelerator-input-width size, but two ore more accelerator-input-width sizes should be there in RMS so that the accelerator can compute and overlap DMA requests to RMS at the same time.

As soon as the RMS requirement is met, the actual accelerator call is initiated.

Once the accelerator is done, the results are written from scratchpad to RMS by the accelerator memory controller's.

Results are written back from RMS to the user address space (main memory). Page faults could occur here as well. This write-back can be left to the library implementation, and could be different if RMS appears as part of a unified address space.

Once the accelerating function execution is done, the status of the task is set to EXECUTED and it can be removed from the queue.

General Purpose Core

A process does not have to trap to the operating system to insert work into a queue.

Upon decoding an accelerator instruction, a GPC sends the task queue manager a request to read from the process' address space the respective task descriptor.

The GPC can quiesce till the flag is set to EXECUTED, so that it can proceed to the next instruction in its pipeline (blocking mode).

A simple hardware locking mechanism can be acquired by the GPC before pushing anything to the task queue. More complicated, concurrent queues might be tried instead.

Programmability

When used by the same process, the multiple accelerators of the same type in a tile can appear as a single, very wide accelerator.

There is an added programmability benefit in maintaining unified address across the different levels of the memory hierrachy: building accelerators which can work fast with pointers.

It helps to think of the task queues as something similar to the queue employed to hold floating point operations in the UltraSparc T1 (Niagara) FPU (shared among 8 GPCs).

The callback function description in the task descriptor above is a very rough idea about how we can support the fall-back-to-GPC and the pipeline-of-accelerators. For the latter, the task queue manager would not set an accelerator call to executed before its chain of dependencies (callback accelerator calls) is executed as well. The task ids and queue ids can be used to build up this chain.