x86 assembly code

The extra memory operations are there because the LLVM optimiser doesn't know that the two ByteArrays I'm using to fake imperative variables don't alias. This should turn into optimal code once it uses pure accumulators.

Wednesday, May 1, 2013

Repa 4 will include a GHC plugin that performs array fusion using a version of Richard Waters's series expressions system, extended to support the segmented operators we need for Data Parallel Haskell.

The plugin converts GHC core code to Disciple Core, performs the fusion transform, and then converts back to GHC core. We're using Disciple Core for the main transform because it has a simple (and working) external core format, and the core language AST along with most of the core-to-core transforms are parametric in the type used for names. This second feature is important because we use a version of Disciple Core where the main array combinators (map, fold, filter etc) are primitive operators rather than regular function bindings.

The fusion transform is almost (almost) working for some simple examples. Here is the compilation sequence:

Haskell Source

process :: Stream k Int -> Int

process s

= fold (+) 0 s + fold (*) 1 s

In this example we're performing two separate reductions over the same stream. None of the existing short-cut fusion approaches can handle this properly. Stream fusion in Data.Vector, Repa-style delayed array fusion, and build/foldr fusion will all read the stream elements from memory twice (if they are already manifest) or compute them twice (if they are delayed). We want to compute both reductions in a single pass.

Raw GHC core converted to DDC core

repa_process_r2 : [k_aq0 : Data].Stream_r0 k_aq0 Int_3J -> Int_3J

= \(k_c : *_34d).\(s_aqe : Stream_r0 k_c Int_3J).

$fNumInt_$c+_rjF

(fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c+_rjF

(I#_6d 0i#) s_aqe)

(fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c*_rjE

(I#_6d 1i#) s_aqe)

Detect array combinators from GHC core, and convert to DDC primops

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#

= /\(k_c : Rate).

\(s_aqe : Stream# k_c Int#).

add# [Int#]

(fold# [k_c] [Int#] [Int#] (add# [Int#]) 0i# s_aqe)

(fold# [k_c] [Int#] [Int#] (mul# [Int#]) 1i# s_aqe)

Normalize and shift array combinators to top-level

All array combinators are used in their own binding.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#

= /\(k_c : Rate).

\(s_aqe : Stream# k_c Int#).

let x0 = add# [Int#] in

let x1 = fold# [k_c] [Int#] [Int#] x0 0i# s_aqe in

let x2 = mul# [Int#] in

let x3 = fold# [k_c] [Int#] [Int#] x2 1i# s_aqe in

add# [Int#] x1 x3

Inline and eta-expand worker functions

This puts the program in the correct form for the next phase.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#

= /\(k_c : Rate).

\(s_aqe : Stream# k_c Int#).

let x1

= fold# [k_c] [Int#] [Int#]

(\(x0 x1 : Int#). add# [Int#] x0 x1) 0i# s_aqe in

let x3

= fold# [k_c] [Int#] [Int#]

(\(x2 x3 : Int#). mul# [Int#] x2 x3) 1i# s_aqe in

add# [Int#] x1 x3

Do the lowering transform

This is the main pass that performs array fusion. Note that we've introduced a single loop# that computes both of the fold# results.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#

= /\(k_c : Rate).

\(s_aqe : Stream# k_c Int#).

let x1_acc : Ref# Int# = new# [Int#] 0i# in

let x3_acc : Ref# Int# = new# [Int#] 1i# in

let _ : Unit

= loop# (lengthOfRate# [k_c])

(\(x0 : Nat#).

let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in

let x0 : Int# = read# [Int#] x1_acc in

let _ : Void#

= write# [Int#] x1_acc (add# [Int#] x0 x1) in

let x2 : Int# = read# [Int#] x3_acc in

let _ : Void#

= write# [Int#] x3_acc (mul# [Int#] x2 x1) in

()) in

let x1 : Int# = read# [Int#] x1_acc in

let x3 : Int# = read# [Int#] x3_acc in

add# [Int#] x1 x3

Assign imperative variable storage to arrays

We need to convert the code back to GHC core, but we don't want to use IORefs because they can't hold unboxed values (of types like Int#). Instead, we use some new arrays to hold these values instead.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#

= /\(k_c : Rate).

\(s_aqe : Stream# k_c Int#).

let x1_acc : Array# Int# = newArray# [Int#] 8# in

let _ : Void# = writeArray# [Int#] x1_acc 0# 0i# in

let x3_acc : Array# Int# = newArray# [Int#] 8# in

let _ : Void# = writeArray# [Int#] x3_acc 0# 1i# in

let _ : Unit

= loop# (lengthOfRate# [k_c])

(\(x0 : Nat#).

let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in

let x0 : Int# = readArray# [Int#] x1_acc 0# in

let _ : Void#

= writeArray# [Int#] x1_acc 0#

(add# [Int#] x0 x1) in

let x2 : Int# = readArray# [Int#] x3_acc 0# in

let _ : Void#

= writeArray# [Int#] x3_acc 0#

(mul# [Int#] x2 x1) in

()) in

let x1 : Int# = readArray# [Int#] x1_acc 0# in

let x3 : Int# = readArray# [Int#] x3_acc 0# in

add# [Int#] x1 x3

Thread state token through effectful primops

The lowered code is naturally imperative, and GHC uses state threading to represent this.