Status of DPH Benchmarks

Overview over the benchmark programs

Computes the sum of the squares from 1 to N using Int. There are two variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. As a reference implementation, we have a sequential C program denoted by "ref C".

Computes the dot product of two vectors of Doubles. There are two variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. In addition to these two DPH variants of the dot product, we also have two non-DPH reference implementations: (a) "ref Haskell" is a Haskell program using imperative, unboxed arrays and and (b) "ref C" is a C implementation using pthreads.

Multiplies a dense vector with a sparse matrix represented in the compressed sparse row format (CSR). There are three variants of this program: (1) "primitives" is directly coded against the array primitives from package dph and (2) "vectorised" is a high-level DPH program transformed by GHC's vectoriser. As a reference implementation, we have a sequential C program denoted by "ref C".

The Sieve of Eratosthenes using parallel writes into a sieve structure represented as an array of Bools. We currently don't have a proper parallel implementation of this benchmark, as we are missing a parallel version of default backpermute. The problem is that we need to make the representation of parallel arrays of Bool dependent on whether the hardware supports atomic writes of bytes. Investigate whether any of the architectures relevant for DPH actually do have trouble with atomic writes of bytes (aka Word8).

Given a set of points (in a plane), compute the sequence of points that encloses all points in the set. There is only a vectorised version. Currently doesn't work due to bugs in dph-par. Also needs to get a wrapper using the new benchmark framework to generated test input and time execution.

Implementation of the Awerbuch-Shiloach and Hybrid algorithms for finding connected components in undirected graphs. There is only a version directly coded against the array primitives. Needs to be adapted to new benchmark framework.

All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against dph-seq and the columns marked "P=n" report times when linked against dph-par and run in parallel using the specified number of parallel OS threads.

Comments regarding SumSq

The versions compiled against dph-par are by factor of two slower than the ones linked against dph-seq.

However, found a number of general problems when working on this example:

We need an extra -funfolding-use-threshold. We don't really want users having to worry about that.

mapP (\x -> x * x) xs essentially turns into zipWithU (*) xs xs, which doesn't fuse with enumFromTo anymore. We have a rewrite rule in the library to fix that, but that's not general enough. We really would rather not vectorise the lambda abstraction at all.

enumFromTo doesn't fuse due to excessive dictionaries in the unfolding of zipWithUP.

Finally, to achieve the current result, we needed an analysis that avoids vectorising subcomputations that don't to be vectorised, and worse, that fusion has to turn back into their original form. In this case, the lambda abstraction \x -> x * x. This is currently implemented in a rather limited and ad-hoc way. We should implement this on the basis of a more general analysis.

Comments regarding DotP

Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated. As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited. The parallel DPH library delivers the same single core performance as the sequential one in this benchmark.

Comments regarding smvm

"SMVM, vectorised" needs a lot of tinkering in the form of special rules at the moment and forcing particular inlines. We need more expressive rewrite rules; in particular, we need these more expressive rules to express important rewrites for the replicate combinator in its various forms and to optimise shape computations that enable other optimisations.

Moreover, "SMVM, primitives" & "SMVM, vectorised" exhibit a strange behaviour from 2 to 4 threads with the matrix of density 0.001. This might be a scheduling problem.

Execution on greyarea (1x UltraSPARC T2)

Software spec: GHC 6.11 (from first week of Mar 09) with gcc 4.1.2 for Haskell code; gccfss 4.0.4 (gcc front-end with Sun compiler backend) for C code (as it generates code that is more than twice as fast for numeric computations than vanilla gcc)

Program

Problem size

sequential

P=1

P=2

P=4

P=8

P=16

P=32

P=64

SumSq, primitives

10M

212/212

254/254

127/127

64/64

36/36

25/25

17/17

10/10

SumSq, vectorised

10M

212/212

254/254

128/128

64/64

32/32

25/25

17/17

10/10

SumSq, ref C

10M

120

–

–

–

–

–

–

–

DotP, primitives

100M elements

937/937

934/934

474/474

238/238

120/120

65/65

38/38

28/28

DotP, vectorised

100M elements

937/937

942/942

471/471

240/240

118/118

65/65

43/43

29/29

DotP, ref Haskell

100M elements

–

934

467

238

117

61

65

36

DotP, ref C

100M elements

–

554

277

142

72

37

22

20

SMVM, primitives

10kx10k @ density 0.1

1102/1102

1112/1112

561/561

285/285

150/150

82/82

63/70

54/100

SMVM, vectorised

10kx10k @ density 0.1

1784/1784

1810/1810

910/910

466/466

237/237

131/131

96/96

87/87

SMVM, ref C

10kx10k @ density 0.1

580

–

–

–

–

–

–

–

SMVM, primitives

100kx100k @ density 0.001

1112/1112

1299/1299

684/684

653/653

368/368

294/294

197/197

160/160

SMVM, vectorised

100kx100k @ density 0.001

1824/1824

2008/2008

1048/1048

1010/1010

545/545

426/426

269/269

258/258

SMVM, ref C

100kx100k @ density 0.001

600

–

–

–

–

–

–

–

All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against dph-seq and the columns marked "P=n" report times when linked against dph-par and run in parallel using the specified number of parallel OS threads.

Comments regarding SumSq

As on LimitingFactor.

Comments regarding DotP

The benchmark scales nicely up to the maximum number of hardware threads. Memory latency is largely covered by excess parallelism. It is unclear why the Haskell reference implementation "ref Haskell" falls of at 32 and 64 threads. See also ​a comparison graph between LimitingFactor and greyarea.

Comments regarding smvm

As on LimitingFactor, but it scales much more nicely and improves until using four threads per core. This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures).

On this machine, "SMVM primitives" & "SMVM, vectorised" also have a quirk from 2 to 4 threads. This re-enforces the suspicion that this is a scheduling problem.

Summary

The speedup relative to a sequential C program for SumSq, DotP, and SMVM on both architectures is illustrated by ​this graph. In all cases, the data parallel Haskell program outperforms the sequential C program by a large margin on 8 cores.