* Numerical Format
a Signed magnitude, fixed-point, 16-bit
a Complex elements for 32-bit/element

user

SExperimental steps
a No high-speed input to system, so data must
be pre-loaded
a XModem over UART provides file transfer
between tested and user workstation
a User prepares measurement equipment,
initiates processing after data is loaded
through UART interface
[ Processing completes relatively quickly,
output file is transferred back to user
a Post-analysis of output data and/or
performance measurements

* Processing starts when all data is buffered
* No inter-processor communication during processing
* Double-buffering maximizes co-processor efficiency
* For each kernel, processing is done along one dimension
* Multiple "processing chunks" may be buffered at a time:
o CFAR co-processor has 8 KB buffers, all others have 4 KB
buffers
o CFAR works along range dimension (1024 elements or 4 KB)
o Implies 2 "processing chunks" processed per buffer by CFAR
engine
* Single co-processing engine kernel execution times for an
entire data cube
o CFAR only 15% faster than Doppler processing, despite 39%
faster buffer execution time
o Loss of performance for CFAR due to under-utilization
o Equation to lower right models execution time of an individual
kernel to process an entire cube (using double-buffering)
Kernel execution time can be capped by both processing time as
well as memory bandwidth
After certain point, higher co-processor frequencies or more
engines per node will become pointless

* Novel node architecture introduced and demonstrated
a All processing performed in hardware co-processor engines
a Apps developed in Xilinx's EDK environment using C, custom API enables control of hardware resources through software
* External memory (SDRAM) throughput at each node is critical for system performance in systems with hardware
processing engines and integrated high-performance network
* Pipelined decomposition may be better for this system, due to co-processor (under)utilization
a If co-processor engines sit idle most of the time, why have them all in each node?
a With sufficient memory bandwidth, multiple engines could be used concurrently
* Parallel data paths are nice feature, at cost of more complex control logic, higher potential development cost
a Multiple request ports to SDRAM controller improves concurrency, but does not remove bottleneck
Different modules within design can request and begin transfers concurrently through FIFOs
SDRAM controller can still only service one request at a time (assuming one external bank of SDRAM)
Benefit of parallel data paths decreases with larger transfer sizes or more frequent transfers
a Parallel state machineslcontrol logic take advantage of FPGA's affinity for parallelism
a Custom design, not standardized like buses (e.g. CoreConnect, AMBA, etc)
* Some co-processor engines could be run at slower clock rates to conserve power without loss of performance
* 32-bit fixed-point numbers (possibly larger) required if not using floating-point processors
a Notable error can be seen in processed data simply by visually comparing to reference outputs
a Error will compound as data propagates through each kernel in a full GMTI application
a Larger precision means more memory and logic resources required, not necessarily slower clock speeds

* Future Research
a Enhance testbed with more nodes, more stable boards, Serial RapidlO
a Complete Beamforming and STAP co-processor engines, demonstrate and analyze full GMTI application
a Enhance architecture with direct data path between processing SRAM and network interface
a More in-depth study of precision requirements and error, along with performancelresource implications