Programming Haskell: string processing (with a dash of concurrency)

Today we’ll look more into the basic tools at our disposal in the Haskell language, in particular, operations for doing IO and playing with files and strings.

Administrivia

Before we get started, I should clarify a small point raised by yesterday’s article. One issue I forgot to mention was that there are slight differences between running Haskell in ghci, the bytecode interpreter, and compiling it to native code with GHC.

Haskell programs are executed by evaluating the special ‘main’ function.

To compile this to native code, we would feed the source file to the compiler:

$ghcA.hs$./a.out7

For a faster turnaround, we can run the code directly through the bytecode interpreter, GHCi, using the ‘runhaskell’ program:

$runhaskellA.hs7

GHCi, the interactive Haskell environment, is a little bit different. As it is an interactive system, GHCi must execute your code sequentially, as you define each line. This is different to normal Haskell, where the order of definition is irrelevant. GHCi effectively executes your code inside a do-block. Therefore you can use the do-notation at the GHCi prompt to define new functions:

IO

As the Camel Book says:

Unless you’re using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world.

In yesterday’s tutorial, I briefly introduced ‘readFile’, for reading a String from a file on disk. Let’s consider now IO in more detail. The most common IO operations are defined in the System.IO library.

For the most basic stdin/stdout Unix-style programs in Haskell, we can use the ‘interact’ function:

interact::(String->String)->IO()

This higher order function takes, as an argument, some function for processing a string (of type String -> String). It runs this function over the standard input stream, printing the result to standard output. A surprisingly large number of useful programs can be written this way. For example, we can write the ‘cat’ unix program as:

main=interactid

Yes, that’s it! Let’s compile and run this program:

$ghc-OA.hs$catA.hs|./a.outmain=interactid

How does this work? Firstly, ‘interact’ is defined as:

interactf=dos<-getContentsputStr(fs)

So it reads a string from standard input, and writes to standard output the result of applying its argument function to that string. The ‘id’ function itself has the type:

id::a->a

‘id’ is a function of one argument, of any type (the lowercase ‘a’ in the type means any type can be used in that position, i.e. it is a polymorphic function (also called a generic function in some languages)). ‘id’ takes a value of some type ‘a’, and returns a value of the same type. There’s is only one (non-trivial) function of this type:

ida=a

So ‘interact id’ will print to the input string to standard output unmodified.

Let’s now write the ‘wc’ program:

main=interactcountcounts=show(lengths)++"n"

This will print the length of the input string, that is, the number of chars:

$runhaskellA.hs<A.hs57

Line oriented IO

Only a small number of programs operate on unstructured input streams. It is far more common to treat an input stream as a list of lines. So let’s do that. To break a string up into lines, we’ll use the … ‘lines’ function, defined in the Data.List library:

lines::String->[String]

The type, once again, tells the story. ‘lines’ takes a String, and breaks it up into a list of strings, splitting on newlines. To join a list of strings back into a single string, inserting newlines, we’d use the … ‘unlines’ function:

unlines::[String]->String

There are also similar functions for splitting on words, namely ‘words’ and ‘unwords’. Now, an example. To count the number of lines in a file:

main=interact(count.lines)

We can run this as:

$ghc-OA.hs$./a.out<A.hs3

Here we reuse the ‘count’ function from above, by composing it with the lines function.

On composition

This nice code reuse via composition is achieved using the (.) function, pronounced ‘compose’. Let’s look at how that works. (Feel free to skip this section, if you want to just get things done).

The (.) function is just a normal everyday Haskell function, defined as:

(.)fgx=f(gx)

This looks a little like magic (or line noise), but its pretty easy. The (.) function simply takes two functions as arguments, along with another value. It applies the ‘g’ function to the value ‘x’, and then applies ‘f’ to the result, returning this final value. The functions may be of any type. The type of (.) is actually:

(.)::(b->c)->(a->b)->a->c

which might look a bit hairy, but it essentially specifies what types of arguments make sense to compose. That is, only those where:

f::b->cg::a->bx::a

can be composed, yielding a new function of type:

(f.g)::a->c

The nice thing is that this composition makes sense (and works) for all types a, b and c.

How does this relate to code reuse? Well, since our ‘count’ function is polymorphic, it works equally well counting the length of a string, or the length of a list of strings. Our littler ‘wc’ program is the epitome of the phrase: “higher order + polymorphic = reusable”. That is, functions which take other functions as arguments, when combined with functions that work over any type, produce great reusable ‘glue’. You only need vary the argument function to gain terrific code reuse (and the strong type checking ensures you can only reuse code in ways that work).

More on lines

Another little example, let’s reverse each line of a file (like the unix ‘rev’ command):

main=interact(unlines.mapreverse.lines)

Which when run, reverses the input lines:

$./a.out<B.hsrahC.ataDtropmiebyaM.ataDtropmitsiL.ataDtropmi

So we take the input string, split it into lines, and the loop over that list of lines, reversing each of them, using the ‘map’ function. Finally, once we’ve reversed each line, we join them back into a single string with unlines, and print it out.

The ‘map’ function is a fundamental control structure of functional programming, similar to the ‘foreach’ keyword in a number of imperative languages. ‘map’ however is just a function on lists, not built in syntax, and has the type:

map::(a->b)->[a]->[b]

That is, it takes some function, and a list, and applies that function to each element of the list, returning a new list as a result. Since loops are so common in programming, we’ll be using ‘map’ a lot. Just for reference, ‘map’ is implemented as:

map_[]=[]mapf(x:xs)=fx:mapfxs

File IO

Operating on stdin/stdout is good for scripts (and this is how tools like sed or perl -p work), but for ‘real’ programs we’ll at least need to do some file IO. The basic operations of files are:

readFile::FilePath->IOStringwriteFile::FilePath->String->IO()

‘readFile’ takes a file name as an argument, does some IO, and returns the file’s contents as a string. ‘writeFile’ takes a file name, a string, and does some IO (writing that string to the file), before returning the void (or unit) value, ().

Since we’re doing IO (the type of readFile and writeFile enforce this), the code runs inside a do-block, using the IO monad. “Using the IO monad” just means that we wish to use an imperative, sequential order of evaluation. (As an aside, a wide range of other monads exist, for programming different program evaluation strategies, such as Prolog-style backtracking, or continutation-based evaluation. All of imperative programming is just one subset of possible evaluation strategies you can use in Haskell, via monads).

In do-notation, whenever you wish to run an action, for its side effect, and save the result to a variable, you would write:

v<-action

For example, to run the ‘readFile’ action, which has the side effect of reading a file from disk, we say:

s<-readFilef

Finally, we can use the ‘appendFile’ function to append to an existing file.

File Handles

The most generic interface to files is provided via Handles. Sometimes we need to keep a file open, for multiple reads or writes. To do this we use Handles, an abstraction much like the underlying system’s file handles.

To open up a file, and get its Handle, we use:

openFile::FilePath->IOMode->IOHandle

So to open a file for reading only, in GHCi:

PreludeSystem.IO>h<-openFile"A.hs"ReadMode{handle:A.hs}

Which returns a Handle onto the file “A.hs”. We can read a line from this handle:

An example: spell checking

Here is a small example of combining the Data.Set and List data structures from yesterday’s tutorial, with more IO operations. We’ll implement a little spell checker, building the dictionary in a Set data type. First, some libraries to import:

Extension: using SMP parallelism

Finally, just for some bonus fun … and hold on to your hat ’cause I’m going to go fast … we’ll add some parallelism to the mix.

Haskell was designed from the start to support easy parallelisation, and since GHC 6.6, multithreaded code will run transparently on multicore systems using as many cores as you specify. Let’s look at how we’d parallelise our little program to exploit multiple cores. We’ll use an explicit threading model, via Control.Concurrent. You can also make your code implicitly parallel, using Control.Parallel.Strategies, but we’ll leave that for another tutorial.

The ‘run’ function sets up a channel between the main thread and all children thread (‘errs’), and prints spelling errors as they arrive on the channel from the children. It then forks off ‘n’ children threads on each piece of the work list:

runndictwork=dochan<-newChanerrs<-getChanContentschan-- errors returned back to main threadmapM_(forkIO.threadchandict)(zip[1..n]work)waitnerrs0

The main thread then just waits on all the threads to finish, printing any spelling errors they pass up:

Now, we can run ‘n’ worker threads (lightweight Haskell threads), mapped onto ‘m’ OS threads. Since I’m using a 4 core linux server, we’ll play around with 4 OS threads. First, running everything in a single thread:

Ok. Good… A little bit faster, uses a little bit more cpu. It turns out however the program is bound currently by the time spent in the main thread building the initial dictionary. Actual searching time is only some 10% of the running time. Nonetheless, it was fairly painless to break up the initial simple program into a parallel version.

If the program running time was extended (as the case for a server), the parallelism would be a win. Additionally, should we buy more cores for the server, all we need to is change the +RTS -N argument to the program, to start utilising these extra cores.

Next week

In upcoming tutorials we’ll look more into implicitly parallel programs, and the use of the new high performance ByteString data type for string processing.