operations from Haskell and its impact on performance is a deficiency that was brought to our attention
by Lippmeier and Keller. 11 The first step we took was to
write a small number of simple C functions utilizing SSE
intrinsics to serve as benchmarks. This gave us a very
concrete goal—to generate machine code from Haskell
that was competitive with these C implementations. It
is not a coincidence that one of the first such C functions that we wrote was an implementation of the vector
dot product, in both a scalar version and a version using
compiler intrinsics for manual SSE support. We omit
the C versions, but repeat the definition of the Haskell
implementation here:

Though not exactly onerous, the C version with SSE support is already unpleasantly more complex than the scalar
version. The Haskell version, consisting of a single line of
code (not including the optional type signature), is certainly
the simplest. Also note that the Haskell programmer can
think compositionally—it is natural to think of dot product
as pairwise multiplication followed by summation. The C
programmer, on the other hand, must manually fuse the
two loops into a single multiply-add. Furthermore, as well
as being constructed compositionally, the Haskell implementation can itself be used compositionally. That is, if the
input vectors to ddotp are themselves the results of vector
computations, generalized stream fusion will potentially
fuse all operations in the chain—not just the dot product’s
zip and fold—into a single loop. In contrast, the C programmer must manifest the input to the C implementation of
ddotp as concrete vectors in memory—there is no potential
for automatic fusion with other operations in the C version.

Figure 2 compares the single-threaded performanceof several implementations of the dot product, includingC and Haskell versions that only use scalar operationsas well as the implementation provided by GotoBLAS21. 13. 5, 6 Times were measured on a 3. 40 GHz Intel i7-2600Kprocessor, averaged over 100 runs. To make the relativeperformance of the various implementations clearer, weshow the execution time of each implementation rela-tive to the scalar C version, which is normalized to 1.0, inFigure 3.

Surprisingly, both the naive scalar C implementation
and the version written using SSE intrinsics perform
approximately the same. This is because GCC automatically vectorizes the scalar implementation. However, the
Haskell implementation is almost always faster than both
C versions; it is 5–20% slower for very short vectors (those
with fewer than about 16 elements) and 1–2% slower just at
the point where the working set size exceeds the capacity of
the L1 cache. Not only does Haskell outperform C on this
benchmark, but it outperforms GCC’s vectorizer. Once the
working set no longer fits in L3 cache, the Haskell implementation is even neck-and-neck with the implementation of
ddotp from GotoBLAS, a collection of highly tuned BLAS
routines hand-written in assembly language that is generally considered to be one of the fastest BLAS implementation available.

5. 1. Prefetching and loop unrolling

Why is Haskell so fast? Because in addition to leveraging
loop fusion and a careful choice of representation, we have
also exploited the high-level stream-fusion framework to
embody two additional optimizations: loop unrolling and
prefetching.

The generalized stream fusion framework allowed us
to implement the equivalent of loop unrolling by adding
under 200 lines of code to the vector library. We changed
the MultisC data type to incorporate a leap, which is a Step
that contains multiple values of type Multi a. We chose Leap
to contain four values—so loops are unrolled four times—
since on x86-64 processors this tends not to put too much