Streaming: Area Optimization

This example shows how to use the subsystem level streaming optimization in HDL Coder.

Introduction

Streaming is a subsystem-wide optimization supported by HDL Coder for implementing area-efficient hardware. By default, the coder implements hardware that is bit-accurate and cycle-accurate to the Simulink model. This implies that vector datapaths in Simulink map inefficiently to hardware. Consider a product block in Simulink that operates on two 64-element vector inputs and generates a 64-element vector output. This block executes 64 multiplications in a single Simulink time step. To remain cycle-accurate, HDL Coder maps this block to 64 parallel multipliers in the generated HDL code. Given that multipliers are expensive on FPGAs, this is an inefficient hardware implementation.

Streaming is an optimization that flattens a vector datapath to either a scalar or a smaller sized vector datapath. The idea is to serialize the execution of parallel hardware, so that resources can be shared and the vector data can be time-multiplexed over the shared resources.

Consider the following example model that operates on a 24-element vector datapath. This model contains 3 vector gains and 2 vector adds, resulting in a hardware implementation containing 72 multipliers and 24 adders. This can be confirmed by generating the resource utilization report when generating HDL code.

Streaming to Scalarize the Datapath

An efficient area implementation of the same model can be realized by setting a positive integer value to the 'StreamingFactor' implementation parameter on the subsystem. This parameter specifies the extent to which the datapath is scalarized - the higher the value, the greater the area savings. In this example, we have a 24-element vector datapath; to fully scalarize it, specify a 'StreamingFactor' value of 24. This can be done either through the HDL block properties dialog (opened by right-clicking on the 'Controller' subsystem) or through the command 'hdlset_param'.

Generating HDL code with 'StreamingFactor' set to 24, generates HDL that uses only 3 multipliers and 2 adders (see the resource report after HDL code generation). The code-generation model explicitly reflects the streaming architecture. The elements of the vector datapath is streamed at a faster rate (in this case 24x faster and denoted in red) and all computations operate on a scalar datapath. At the outputs, the vector is reconstructed using a tapped delay and the output is sampled back at the slower rate (in green).

Delay Balancing and Functional Equivalence

The rate transitions that implement time-multiplexing in the streaming architecture introduce a cycle of additional latency. To maintain functional fidelity, this delay must be balanced across all cut-sets that this path is a member of. When the streaming option is turned on, the coder automatically also turns on the delay balancing option ('BalanceDelays') to automatically balance this additional delay (see the example, Delay Balancing and Validation Model Workflow In HDL Coder™, for more details). The coder also automatically turns on the validation model generation option so the user can verify that functional equivalence is maintained with respect to the original model.

Parameterizability for More Flexibility

By tuning the 'StreamingFactor' parameter, one can explore the design space along the datapath size dimension. A value of 1 implies no streaming (or fully parallel implementation), and a value of 24 (or the full vector length) implies maximal streaming (or fully serial implementation). By picking values between these two extremes, one can explore the design space from fully parallel to fully serial implementations.

If we set 'StreamingFactor' to 6 in this example model, we get a 4-element vector datapath in the generated HDL. This results in the use of 12 multipliers and 8 adders as shown in the resource report.