Application Defined Processors

By rebuilding a system's logic on the fly, this project can make one FPGA do the work of tens or hundreds of ordinary processors.

But Can a DEL Processor Run Linux?

DEL-based processors could run Linux, but do they
need to? Code segments within
the Linux kernel certainly might benefit in performance
from running on a DEL processor, and applications
within the Linux distributions also could achieve
higher performance. However, the role of an operating
system, and the kernel in particular, is to manage the
hardware such that applications achieve their required
performance levels. In other words, the OS is supposed
to stay out of the way and let applications consume
the hardware.

Applications do a lot more than intense
computation. They interact with users, read and write
files, display results and communicate with the world
through Internet connections. Thus, applications
require both computational resources and the services
of an operating system. Heavy computation with
high parallelism benefits from DEL processors. Although
serial code could run as DEL, it is best serviced in
a traditional microprocessor.

The best combination of hardware for running most
applications is a mix of microprocessor and DEL
processors. This combination allows applications to
achieve orders of magnitude performance gains while
still running in a standard Linux environment with all
of the OS services and familiar support tools. The
portion of an application that is predominantly
sequential or that requires OS services can run in
a traditional microprocessor portion of a system,
while applications and even portions of the OS that
benefit from the DEL parallelism run on a closely
coupled DEL processor.

SRC Computers, Inc.'s RC System

SRC has created systems that are composed of DEL processors and
microprocessors. SRC systems run Linux as the OS, provide a programming
environment called Carte for creating applications composed of both
microprocessor instructions and DEL, and support microprocessor and DEL
processor hardware in a single system.

The DEL Processor—MAP

The patented MAP processor is SRC's high-performance
DEL processor. MAP uses reconfigurable components
to accomplish control and user-defined compute,
data prefetch and data access functions. This
compute capability is teamed with very high on- and
off-board interconnect bandwidth. MAP's multiple
banks of dual-ported On-Board Memory provide
11.2GB/sec of local memory bandwidth. MAP is
equipped with separate input and output ports with
each port sustaining a data payload bandwidth of 1.4GB/sec. Each MAP also
has two general-purpose I/O
(GPIO) ports, sustaining an additional data payload
of 4.8GB/sec for direct MAP-to-MAP connections
or data source input. Figure 3 presents the block
diagram of the MAP processor.

Figure 3. Block Diagram of MAP

Microprocessor with SNAP

The Dense Logic Devices (DLDs) used in these products are the
dual-processor Intel IA-32 line of microprocessors. These third-party commodity
boards are then equipped with the SRC-developed SNAP interface. SNAP
allows commodity microprocessor boards to connect to, and share memory
with, MAPs and Common Memory nodes that make up the rest of the SRC
system.

The SNAP interface is designed to plug directly in to the microprocessors'
memory subsystem, instead of its I/O subsystem, allowing SRC systems to
sustain significantly higher interconnect bandwidths. SNAP uses separate
input and output ports with each port currently sustaining a data payload
bandwidth of 1.4GB/sec.

The intelligent DMA controller on SNAP is capable of performing complex
DMA prefetch and data access functions, such as data packing, strided
access and scatter/gather, to maximize the efficient use of the system
interconnect bandwidth. Interconnect efficiencies more than ten times
greater than a cache-based microprocessor using the same interconnect
are common for these operations.

SNAP either can connect directly to a single MAP or to SRC's
Hi-Bar switch for system-wide access to multiple MAPs,
microprocessors or Common Memory.

SRC-6 System-Level Architectural Implementation

System-level configurations implement either a cluster of MAPstations
or a crossbar switch-based topology. Cluster-based systems, as shown
in Figure 4, utilize the microprocessor and DEL processor previously
discussed in a direct connected configuration. Although this topology does
have a microprocessor-DEL processor affinity, it also has the benefit of
using standards-based clustering technology to create very large systems.

Figure 4. Block Diagram of Clustered SRC-6 System

When more flexibility is desired, Hi-Bar switch-based systems can
be employed. Hi-Bar is SRC's proprietary scalable, high-bandwidth,
low-latency switch. Each Hi-Bar supports 64-bit addressing and has 16
input and 16 output ports to connect to 16 nodes. Microprocessors,
MAPs and Common Memory nodes can all be connected to Hi-Bar in any
configuration as shown in Figure 4. Each input or output port sustains a
yielded data payload of 1.4GB/sec for an aggregate yielded bisection
data bandwidth of 22.4GB/sec per 16 ports. Port-to-port latency is
180ns with Single Error Correction and Double Error Detection (SECDED)
implemented on each port.

Hi-Bar switches also can be interconnected in multitier configurations,
allowing two tiers to support 256 nodes. Each Hi-Bar switch is housed
in a 2U-high, 19-inch wide rackmountable chassis, along with its power
supplies and cooling solution, for easy inclusion into rack-based servers.

Figure 5. Block Diagram of SRC-6 with Hi-Bar Switch

SRC servers that use the Hi-Bar crossbar switch interconnect can
incorporate Common Memory nodes in addition to microprocessors and
MAPs. Each of these Common Memory nodes contains an intelligent
DMA controller and up to 8GBs of DDR SDRAM. The SRC-6 MAPs, SNAPs
and Common Memory node (CM) support 64-bit virtual addressing of all
memory in the system, allowing a single flat address space to be used
within applications. Each node sustains memory reads and writes with
1.4GB/sec of yielded data payload bandwidth.

The CM's intelligent DMA controller is capable of performing
complex DMA functions such as data packing, strided access and
scatter/gather to maximize the efficient use of the system interconnect
bandwidth. Interconnect efficiencies more than ten times greater than a
cache-based microprocessor using the same interconnect are common for
these operations.

In addition, SRC Common Memory nodes have dedicated semaphore circuitry
that also is accessible by all MAP processors and microprocessors for
synchronization.