Numeric Haskell: A Repa Tutorial

Repa is a Haskell library for
high performance, regular, multi-dimensional parallel arrays. All
numeric data is stored unboxed and functions written with the Repa
combinators are automatically parallel (provided you supply "+RTS -N" on
the command line when running the program).

This document provides a tutorial on array programming in Haskell using the Repa package and was based on version 3.2 of the library. Repa versions earlier than 3.0 use different API and are not compatible with this tutorial.

Type indexes

Repa 3 introduced a notion of type index:

data family Array rep sh e

In the above definition rep is the type index, which is represented with a data type name and defines the type of underlying array representation. This allows to explicitly specify representation of an array on type system level. Repa distinguishes ten different types of representation, which fall into three categories: manifest, delayed and meta. Manifest arrays are represented by concrete values stored in memory using e.g. unboxed vectors (represented with data type U) or strict bytestrings (B). Delayed array (D) is a function from index to a value. Every time an element is requested from a delayed array it is calculated anew, which means that delayed arrays are inefficient when the data is needed multiple times. On the other hand delaying an array allows to perform optimisation by fusion. Meta arrays provide additional information about underlying array data, which can be used for load balancing or speeding up calculation of border effects. See Guiding Parallel Array Fusion with Indexed Types paper for a detailed explanation of each representation.

Array shapes

Before we can get started manipulating arrays, we need a grasp of repa's notion of array shape. Much like the classic 'array' library in Haskell, repa-based arrays are parameterized via a type which determines the dimension of the array, and the type of its index. However, while classic arrays take tuples to represent multiple dimensions, Repa arrays use a richer type language for describing multi-dimensional array indices and shapes (technically, a heterogeneous snoc list).

Shape types are built somewhat like lists. The constructor Z corresponds
to a rank zero shape, and is used to mark the end of the list. The :. constructor adds additional dimensions to the shape. So, for example, the shape:

is the type of a two-dimensional array of unboxed doubles, indexed via Int keys, while

ArrayUZDouble

is a zero-dimension object (i.e. a point) holding an unboxed Double.

Many operations over arrays are polymorphic in the shape / dimension
component. Others require operating on the shape itself, rather than
the array. A typeclass, Shape, lets us operate uniformly
over arrays with different shape.

Building shapes

To build values of shape type, we can use the Z and :. constructors. Open the ghci and import Repa:

ghci>:m+Data.Array.Repaghci>Z-- the zero-dimensionZ

For arrays of non-zero dimension, we must give a size. Note: a common error is to leave off the type of the size.

ghci>:tZ:.10Z:.10::Numhead=>Z:.head

leading to annoying type errors about unresolved instances, such as:

No instance for (Shape (Z :. head0))

To select the correct instance, you will need to annotate the size literals with their concrete type:

ghci>:tZ:.(10::Int)Z:.(10::Int)::Z:.Int

is the shape of 1D arrays of length 10, indexed via Ints.

Given an array, you can always find its shape by calling extent.

Additional convenience types for selecting particular parts of a shape are also provided (All, Any, Slice etc.) which are covered later in the tutorial.

Working with shapes

That one key operation, extent, gives us many attributes of an array:

-- Extract the shape of the arrayextent::(Shapesh,Sourcere)=>Arrayrshe->sh

So, given a 3x3x3 array, of type Array U DIM3 Int, we can:

-- build an arrayghci>letx::ArrayUDIM3Int;x=fromListUnboxed(Z:.(3::Int):.(3::Int):.(3::Int))[1..27]ghci>:txx::ArrayUDIM3Int-- query the extentghci>extentx((Z:.3):.3):.3-- compute the rank (number of dimensions)ghci>letsh=extentxghci>ranksh3-- compute the size (total number of elements)>sizesh27-- extract the elements of the array as a flat vectorghci>toUnboxedxfromList[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]::Data.Vector.Unboxed.Base.VectorInt

Generating arrays

New repa arrays ("arrays" from here on) can be generated in many ways, and we always begin by importing the Data.Array.Repa module:

The v function creates a vector filled with pixel data (RGBA values); ptr2repa takes a Ptr, converts it to ForeignPtr which is then used to copy data into a 3D REPA array (first two dimensions correspond to pixel location, the third dimension corresponds to colour channel). In the main function we use unsafeWith function operating on storable vectors. It passes Ptr to vector's internal data to a function that operates on that Ptr (it may not modify the data). In our case this is the ptr2repa array, which returns image data wrapped in an array. That array is then passed as data to writeImage function. RGBA is a data constructor for Image type. This allows writeImage to interpret array data correctly. runIL is a wrapper around IO monad that guarantees that IL library will be properly initialized. The result is written to disk as this image:

Indexing arrays

To access elements in repa arrays, you provide an array and a shape, to access the element:

(!)::(Shapesh,Sourcere)=>Arrayrshe->sh->e

Indices start with 0. So:

ghci>letx=fromListUnboxed(Z:.(10::Int))[1..10]ghci>x!(Z:.2)3.0

Note that we can't give just a bare literal as the shape, even for one-dimensional arrays:

>x!2Noinstancefor(Num(Z:.Int))arisingfromtheliteral`2'

as the Z type isn't in the Num class, and Haskell's numeric literals are overloaded.

To display y array it must be turned into manifest array by computing all its values. This can be done either with computeP, which evaluates array in parallel, or computeS, which evaluates values sequentially. Looking at the type of computeP:

we can see that the result is enclosed within a monad. The reason for this is that monads give a well defined notion of sequence and thus computeP enforces completion of parallel evaluation in a particular point of monadic computations. This prevents nested data parallelism, which is not supported by Repa. We don't really care here about monadic effect, so we can take the manifest array out of a monad after computeP is done:

There are four fold operations in Repa library: foldP, foldS, foldAllP and foldAllS. First two reduce the the inner dimension of the array, the last two reduce whole array to one value. The P and S suffixes stand for parallel and sequential evaluation. First parameter of parallel folds must be an associative sequential operator and the starting element must be a neutral element of that operator. This is required in order to ensure correctness of parallel evaluation.

If the extent of the two array arguments differ, then the resulting array's extent is their intersection.

Mapping, with indices

A very powerful operator is traverse, an array transformation which also supplies the current index:

traverse::(Shapesh',Shapesh,Sourcera)=>Arrayrsha-- Source array->(sh->sh')-- Function to produce the extent of the result.->((sh->a)->sh'->b)-- Function to produce elements of the result.-- It is passed a lookup function to-- get elements of the source.->ArrayDsh'b

This is quite a complicated type, because it is very general. Let's take it apart. The first argument is the source array, which is obvious. The second argument is a function that transforms the shape of the input array to yield the output array. So if the arrays are the same size, this function is id. It might grow or resize the shape in other ways.

Finally, the 3rd argument is where the magic is. It is a function (often anonymous one) that takes two parameters: a function that looks up elements of an original array and a location of the currently processed element in the new array. Based on these two parameters a new value is determined for the resulting array. Note that traverse goes through all elements of the resulting array and it is up to you how to get values of original array based on an index of the new array.

traverse returns a delayed array, which means we will use computeP to evaluate its values and display them on the screen.

So we see this generalizes map to support indexes, and optional inspection of the current element. Let's try some examples:

ghci>letx=fromListUnboxed(Z:.(3::Int):.(3::Int):.(3::Int))[1..27]::ArrayUDIM3Intghci>xAUnboxed(((Z:.3):.3):.3)(fromList[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27])-- Keeping the shape the same, and just overwriting elements-- Use `traverse` to set all elements to their `x` axis:ghci>computeP$traversexid(\_(Z:.i:.j:.k)->i)::IO(ArrayUDIM3Int)AUnboxed(((Z:.3):.3):.3)(fromList[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2])-- Shuffle elements around, based on their index.-- Rotate elements by swapping elements from rotated locations:ghci>computeP$traversexid(\f(Z:.i:.j:.k)->f(Z:.j:.k:.i))::IO(ArrayUDIM3Int)AUnboxed(((Z:.3):.3):.3)(fromList[1,4,7,10,13,16,19,22,25,2,5,8,11,14,17,20,23,26,3,6,9,12,15,18,21,24,27])-- Take a slice of one dimension. The resulting shape of the array changesghci>computeP$traversex(\(e:._)->e)(\f(Z:.i:.j)->f(Z:.i:.j:.0))::IO(ArrayUDIM2Int)AUnboxed((Z:.3):.3)(fromList[1,4,7,10,13,16,19,22,25])

Numeric operations: negation, addition, subtraction, multiplication

To perform basic numeric operations on Repa arrays you must use specialized operators: (+^) (addition), (-^) (subtraction), (*^) (multiplication) and (/^) (division). All these operators work element-wise and produce delayed arrays (they have to be computed if you want to have results as unboxed arrays).

The swap function reorders the index space of the array.
To do this, we extract the current shape of the array, and write a function
that maps the index space from the old array to the new array. That index space function is then passed to backpermute which actually constructs the new array from the old one.

backpermute generates a new array from an old, when given the new shape, and a function that translates between the index space of each array (i.e. a shape transformer).

Example: matrix-matrix multiplication

A more advanced example from the Repa paper is matrix-matrix multiplication. The result of matrix multiplication is a matrix whose elements are found by multiplying the elements of each row from the first matrix by the associated elements of the same column from the second matrix and summing the result.

if and

then

So we take two, 2D arrays and generate a new array, using our transpose
function from earlier:

The idea is to expand both 2D argument arrays into 3D arrays by
replicating them across a new axis using extend function. The front face of the cuboid that results represents the array a, which we replicate as often as b has columns (colsB), producing aRepl.

The top face represents t (the transposed b), which we
replicate as often as a has rows (rowsA), producing
bRepl,. The two replicated arrays have the same extent,
which corresponds to the index space of matrix multiplication.

Optimized implementations of this function are available in the
repa-algorithms package.

Example: parallel image desaturation

To convert an image from color to greyscale, we can use the luminosity method to average RGB pixels into a common grey value, where the average is weighted for human perception of green.

importSystem.EnvironmentimportData.WordimportData.Array.Repahiding((++))importData.Array.Repa.IO.DevILimportData.Array.Repa.Repr.ForeignPtr---- Read an image, desaturate, write out with new name.--main=do[f]<-getArgsrunIL$do(RGBa)<-readImagefb<-(computeP$traverseaidluminosity)::IL(ArrayFDIM3Word8)writeImage("grey-"++f)(RGBb)

And now the luminosity transform itself, which averages the 3 RGB colors based on perceived weight:

Optimising Repa programs

Note: This section is based on older version of Repa (2.x). It seems that it is not entirely accurate for Repa 3. For example there is no force function and array type (manifest or delayed) is now explicitly defined using type index. I do not feel sufficiently competent to update this section so I am leaving it "as is" for the time being. --killy9999 09:12, 11 October 2012 (UTC)

Fusion, and why you need it

Repa depends critically on array fusion to achieve fast code. Fusion is a fancy name for the combination of inlining and code transformations performed by GHC when it compiles your program. The fusion process merges the array filling loops defined in the Repa library, with the "worker" functions that you write in your own module. If the fusion process fails, then the resulting program will be much slower than it needs to be, often 10x slower an equivalent program using plain Haskell lists. On the other hand, provided fusion works, the resulting code will run as fast as an equivalent cleanly written C program. Making fusion work is not hard once you understand what's going on.

The force function has the loops

Suppose we have the following binding:

arr' = R.force $ R.map (\x -> x + 1) arr

The right of this binding will compile down to code that first allocates the result array arr', then iterates over the source array arr, reading each element in turn and adding one to it, then writing to the corresponding element in the result.

Importantly, the code that does the allocation, iteration and update is defined as part of the force function. This forcing code has been written to break up the result into several chunks, and evaluate each chunk with a different thread. This is what makes your code run in parallel. If you do not use force then your code will be slow and not run in parallel.

Delayed and Manifest arrays

In the example from the previous section, think of the R.map (\x -> x + 1) arr expression as a specification for a new array. In the library, this specification is referred to as a delayed array. A delayed array is represented as a function that takes an array index, and computes the value of the element at that index.

Applying force to a delayed array causes all elements to be computed in parallel. The result of a force is referred to as a manifest array. A manifest array is a "real" array represented as a flat chunk of memory containing array elements.

All Repa array operators will accept both delayed and manifest arrays. However, if you index into a delayed array without forcing it first, then each indexing operation costs a function call. It also recomputes the value of the array element at that index.

Shells and Springs

Here is another way to think about Repa's approach to array fusion. Suppose we write the following binding:

arr' = R.force $ R.map (\x -> x * 2) $ R.map (\x -> x + 1) arr

Remember from the previous section, that the result of each of the applications of R.map is a delayed array. A delayed array is not a "real", manifest array, it's just a shell that contains a function to compute each element. In this example, the two worker functions correspond to the lambda expressions applied to R.map.

When GHC compiles this example, the two worker functions are fused into a fresh unfolding of the parallel loop defined in the code for R.force. Imagine holding R.force in your left hand, and squashing the calls to R.map into it, like a spring. Doing this breaks all the shells, and you end up with the worker functions fused into an unfolding of R.force.

INLINE worker functions

During compilation, we need GHC to fuse our worker functions into a fresh unfolding of R.force. In this example, fusion includes inlining the definition of f. If f is not inlined, then the performance of the compiled code will be atrocious. It will perform a function call for each application of f, where it really only needs a single machine instruction to increment the x value.

Now, in general, GHC tries to avoid producing binaries that are "too big". Part of this is a heuristic that controls exactly what functions are inlined. The heuristic says that a function may be inlined only if it is used once, or if its definition is less than some particular size. If neither of these apply, then the function won't be inlined, killing performance.

For Repa programs, as fusion and inlining has such a dramatic effect on performance, we should absolutely not rely on heuristics to control whether or not this inlining takes place. If we rely on a heuristic, then even if our program runs fast today, if this heuristic is ever altered then some functions that used to be inlined may no longer be.

The moral of the story is to attach INLINE pragmas to all of your client functions that compute array values. This ensures that these critical functions will be inlined now, and forever.