However, in zipWiths there is no loop—the two input
streams are consumed until either both streams yield
a value, in which case a value is yielded on the output
stream, or until one of the input streams is done producing values. The internal state of the stream associated
with zip Withs contains the state of the two input streams
and a one-item buffer for the value produced by the first
input stream:

Given these definitions, GHC’s call-pattern specialization in concert with the standard inliner suffice to transform
dotp into a single loop that does not produce an intermediate structure. If there is any doubt that this results in efficient machine code, we give the actual assembly language
inner loop output by GHC using the LLVM back end. Stream
fusion preserves the ability to write compositionally without
sacrificing performance:

Though stream fusion does well for the examples we haveshown, it still does not produce efficient implementationsfor many other operations. In particular, the inadequacy ofthe single-value-at-a-time nature of streams becomes par-ticularly problematic when attempting to opportunisticallyutilize the SIMD instructions available on many currentarchitectures, for example, SSE on x86 and NEON on ARM.These instructions operate in parallel on data values thatcontains two (or four or eight, depending on the hardwarearchitecture) floating point numbers at a time. To avoid nota-tional confusion, we call these multi-values, or sometimesjust multis.

To enable sum to use SIMD instructions, we would like a
stream representation that yields multi-values (rather than
scalars), with perhaps a bit of scalar “dribble” at the end of
the stream when the number of scalar values is not divisible
by the size of a multi.

Although a stream of scalar values is useless for SIMD
computation, a stream of multi-values is not quite right
either, because of the “dribble” problem. Perhaps, we could
get away with a stream that yielded either a scalar or a multi
at each step, but this would force all scalar-only operations
to handle an extra case, complicating the implementations
of all operations and making them less efficient. There is a
better way!

3. GENERALIZED STREAM FUSION

We have seen that different stream operations work best
with different stream representations. In this section, we
describe how to incorporate multiple stream representations into the stream fusion framework, elaborate on the
details of a representation that enables SIMD computation with vectors, and show how to use our framework to
transparently take advantage of SIMD instructions in Data
Parallel Haskell programs.

The idea underlying generalized stream fusion is straightforward but its effects are wide-ranging: instead of transforming a function over vectors into a function over streams,
transform it into a function over a bundle of streams. A bundle is a collection of streams, each semantically identical but
with a different cost model, allowing each stream operation
to choose the most advantageous stream representation in
the bundle. We give a simplified version of the Bundle data
type here:

data Bundle a = Bundle

{sSize :: Size

, sElems :: Stream a

, sChunks :: Stream (Chunk a)
, sMultis :: Multis a

}The sElems field of the Bundle data type contains thefamiliar stream of scalar values that we saw in Section 2.The stream of Chunks contained in the sChunks field ofthe record enables the efficient use of bulk memory opera-tions, like vector append, which we do not describe here. Wenext describe the representation contained in the sMultisfield of the record, which enables the efficient use of SSEinstructions.