So far, we've been talking mostly about high-level concepts. Haskell can
also be used for lower-level systems programming. It is quite possible
to write programs that interface with the operating system at a low level
using Haskell.

In this chapter, we are going to attempt something ambitious: a Perl-like
"language" that is valid Haskell, implemented in pure Haskell, that makes
shell scripting easy. We are going to implement piping, easy command
invocation, and some simple tools to handle tasks that might otherwise be
performed with grep or sed.

Specialized modules exist for different operating systems. In this
chapter, we will use generic OS-independent modules as much as possible.
However, we will be focusing on the POSIX environment for much of the
chapter. POSIX is a standard for Unix-like operating systems such as
Linux, FreeBSD, MacOS X, or Solaris. Windows does not support POSIX by
default, but the Cygwin environment provides a POSIX compatibility layer
for Windows.

Running External Programs

It is possible to invoke external commands from Haskell. To do that,
we suggest using rawSystem from the
System.Cmd module. This will invoke a specified
program, with the specified arguments, and return the exit code from
that program. You can play with it in ghci:

Here, we run the equivalent of the shell command ls -l /usr.
rawSystem does not parse arguments from a string or
expand wildcards.[43]
Instead, it expects every argument to be contained
in a list. If you don't want to pass any arguments, you can simply
pass an empty list like this:

Directory and File Information

The System.Directory module contains quite a few
functions that can be used to obtain information from the filesystem.
You can get a list of files in a directory, rename or delete files,
copy files, change the current working directory, or create new
directories. System.Directory is portable and works
on any platform where GHC itself works.

The library
reference for System.Directory provides a
comprehensive list of the functions available. Let's use ghci to
demonstrate a few of them. Most of these functions are straightforward
equivalents to C library calls or shell commands.

getDirectoryContents returns a list for every item
in a given directory. Note that on POSIX systems, this list normally
includes the special values "." and
"..". You will usually want to filter these out
when processing the content of the directory, perhaps like this:

Is the filter (`notElem` [".", ".."]) part
confusing? That could got also be written as filter
(\c -> not $ elem c [".", ".."]). The backticks in
this case effectively let us pass the second argument to
notElem; see the section called “Infix functions” for
more information on backticks.

You can also query the system about the location of certain
directories. This query will ask the underlying operating system for
the information.

Program Termination

Developers often write individual programs to accomplish particular
tasks. These individual parts may be combined to accomplish larger
tasks. A shell script or another program may execute them. The
calling script often needs a way to discover whether the program was
able to complete its task successfully. Haskell automatically
indicates a non-successful exit whenever a program is aborted by an
exception.

However, you may need more fine-grained control over the
exit code than that. Perhaps you need to return different codes for
different types of errors.
The System.Exit module provides a way to exit the
program and return a specific exit status code to the caller. You can
call exitWith ExitSuccess to return a code
indicating a successful termination (0 on POSIX systems). Or, you can
call something like exitWith (ExitFailure 5), which
will return code 5 to the calling program.

Dates and Times

Everything from file timestamps to business transactions involve dates
and times. Haskell provides ways for manipulating dates and times, as
well as features for obtaining date and time information from the
system.

ClockTime and CalendarTime

In Haskell, the
System.Time module is primarily responsible for
date and time handling. It defines two types: ClockTime and
CalendarTime.

ClockTime is the Haskell version of
the traditional POSIX epoch. A ClockTime represents a time
relative to midnight the morning of January 1, 1970, UTC. A
negative ClockTime represents a number of seconds prior to that date, while a
positive number represents a count of seconds after it.

ClockTime is convenient for computations. Since it tracks
Coordinated Universal Time (UTC), it doesn't have to adjust for local
timezones, daylight saving time, or other special cases in time
handling. Every day is exactly (60 * 60 * 24) or 86,400
seconds[44], which makes time
interval calculations simple. You can, for instance, check the
ClockTime at the start of a long task, again at the end, and simply
subtract the start time from the end time to determine how much time
elapsed. You can then divide by 3600 and display the elapsed time as a
count of hours if you wish.

ClockTime is ideal for answering questions such as these:

How much time has elapsed?

What will be the ClockTime 14 days ahead of this
precise instant?

When was the file last modified?

What is the precise time right now?

These are good uses of ClockTime because they refer to precise,
unambiguous moments in time. However, ClockTime is not as easily
used for questions such as:

Is today Monday?

What day of the week will May 1 fall on next
year?

What is the current time in my local timezone,
taking the potential presence of Daylight Saving Time (DST) into
account?

CalendarTime stores a time the way humans do: with a year, month,
day, hour, minute, second, timezone, and DST information. It's easy
to convert this into a conveniently-displayable string, or to answer
questions about the local time.

You can convert between ClockTime and CalendarTime at will.
Haskell includes functions to convert a ClockTime to a
CalendarTime in the local timezone, or to a CalendarTime
representing UTC.

Using ClockTime

ClockTime is defined in System.Time like this:

data ClockTime = TOD Integer Integer

The first Integer represents the number of seconds since the
epoch. The second Integer represents an additional number of
picoseconds. Because ClockTime in Haskell uses the unbounded
Integer type, it can effectively represent a date range limited only
by computational resources.

Let's look at some ways to use ClockTime. First, there is the
getClockTime function that returns the current
time according to the system's clock.

If you wait a second and run getClockTime again,
you'll see it returning an updated time. Notice that the output
from this command was a nice-looking string, complete with
day-of-week information. That's due to the Show instance for
ClockTime. Let's look at the ClockTime at a lower level:

Here we first construct a ClockTime representing the point in
time 1000 seconds after midnight on January 1, 1970, UTC.
That moment in time is known as the epoch.
Depending on your timezone, this moment in time may correspond to
the evening of December 31, 1969, in your local timezone.

The second example shows us pulling the number of seconds out of
the value returned by getClockTime. We can now
manipulate it, like so:

There are a few things about these structures that should be
highlighted:

ctWDay, ctYDay, and ctTZName are generated by the library
functions that create a CalendarTime, but are not used
in calculations. If you are creating a CalendarTime by hand,
it is not necessary to put accurate values into these fields,
unless your later calculations will depend upon them.

All of these three types are members of the Eq, Ord,
Read, and Show typeclasses. In addition,
Month and Day are declared as members of the
Enum and Bounded typeclasses. For more information on
these typeclasses, refer to
the section called “Important Built-In Typeclasses”.

You can generate CalendarTime values several ways. You could
start by converting a ClockTime to a CalendarTime such as
this:

We used getClockTime to obtain the current
ClockTime from the system's clock. Next,
toCalendarTime converts the ClockTime to a
CalendarTime representing the time in the local timezone.
toUTCtime performs a similar conversion,
but its result is in the UTC timezone instead of the local
timezone.

Notice that toCalendarTime is an IO
function, but toUTCTime is not. The reason
is that toCalendarTime returns a different
result depending upon the locally-configured timezone, but
toUTCTime will return the exact same result
whenever it is passed the same source ClockTime.

In this example, we first took the CalendarTime value from
earlier and simply switched its year to 1960. Then, we used
toClockTime to convert the unmodified value
to a ClockTime, and then the modified value, so you can see
the difference. Notice that the modified value shows a
negative number of seconds once converted to ClockTime.
That's to be expected, since a ClockTime is an offset from
midnight on January 1, 1970, UTC, and this value is in 1960.

Note that even though January 15, 2010, isn't a Sunday -- and
isn't day 0 in the year -- the system was able to process this
just fine. In fact, if we convert the value to a ClockTime
and then back to a CalendarTime, you'll find those fields
properly filled in:

TimeDiff for ClockTime

Because it can be difficult to manage differences between
ClockTime values in a human-friendly way, the
System.Time module includes a TimeDiff type.
TimeDiff can be used, where convenient, to handle these
differences. It is defined like this:

Functions such as diffClockTimes and
addToClockTime take a ClockTime and a
TimeDiff and handle the calculations internally by converting to
a CalendarTime in UTC, applying the differences, and converting
back to a ClockTime.

We started by generating a ClockTime representing midnight
February 5, 2008 in UTC. Note that, unless your timezone is the
same as UTC, when this time is printed out on the display, it may
show up as the evening of February 4 because it is formatted for
your local timezone.

Next, we add one month to to it by calling
addToClockTime. 2008 is a leap year, but the
system handled that properly and we get a result that has the same
date and time in March. By using toUTCTime, we
can see the effect on this in the original UTC timezone.

For a second experiment, we set up a time representing midnight
on January 30, 2009 in UTC. 2009 is not a leap year, so we might
wonder what will happen when trying to add one month to it. We can
see that, since neither February 29 or 30 exist in 2009, we wind up
with March 2.

Finally, we can see how diffClockTimes turns two
ClockTime values into a TimeDiff, though only the seconds and
picoseconds are filled in. The
normalizeTimeDiff function takes such a
TimeDiff and reformats it as a human might expect to see it.

File Modification Times

Many programs need to find out when particular files were last
modified. Programs such as ls or graphical file
managers typically display the modification time of files.
The System.Directory module contains a
cross-platform getModificationTime function. It
takes a filename and returns a ClockTime representing the time the
file was last modified. For instance:

POSIX platforms maintain not just a modification time (known as
mtime), but also the time of last read or write access (atime) and
the time of last status change (ctime). Since this information is
POSIX-specific, the cross-platform
System.Directory module does not provide access to
it. Instead, you will need to use functions in
System.Posix.Files. Here is an example function
to do that:

Notice that call to getFileStatus. That call maps
directly to the C function stat(). Its return
value stores a vast assortment of information, including file
type, permissions, owner, group, and the three time values we're
interested in. System.Posix.Files provides
various functions, such as accessTime, that
extract the information we're interested out of the opaque
FileStatus type returned by
getFileStatus.

The functions such as accessTime return
data in a POSIX-specific type called
EpochTime, which se convert to a
ClockTime using the toct function.
System.Posix.Files also provides a
setFileTimes function to set the atime and mtime
for a file.[45]

Extended Example: Piping

We've just seen how to invoke external programs. Sometimes we need
more control that that. Perhaps we need to obtain the output from
those programs, provide input, or even chain together multiple external
programs. Piping can help with all of these needs.
Piping is often used in shell
scripts. When you set up a pipe in the shell, you run multiple
programs. The output of the first program is sent to the input of the
second. Its output is sent to the third as input, and so on. The last
program's output normally goes to the terminal, or it could go to a
file. Here's an example session with the POSIX
shell to illustrate piping:

This command runs three programs, piping data between them. It starts
with ls /etc, which outputs a list of all files or
directories in /etc. The output of
ls is sent as input to grep. We
gave grep a regular expression that will cause it to
output only the lines that start with 'm' and then
contain "ap" somewhere in the line. Finally, the
result of that is sent to tr. We gave
tr options to convert everything to uppercase. The
output of tr isn't set anywhere in particular, so it
is displayed on the screen.

In this situation, the shell handles setting up all the pipelines
between programs. By using some of the POSIX tools in Haskell, we can
accomplish the same thing.

Before describing how to do this, we should first warn you that the
System.Posix modules expose a very low-level
interface to Unix systems. The interfaces can be complex and their
interactions can be complex as well, regardless of the programming
language you use to access them. The full nature of these low-level
interfaces has been the topic of entire books themselves, so in this
chapter we will just scratch the surface.

Using Pipes for Redirection

POSIX defines a function that creates a pipe. This function returns
two file descriptors (FDs), which are similar in concept to a Haskell
Handle. One FD is the reading end of the pipe, and the other is the
writing end. Anything that is written to the writing end can be read
by the reading end. The data is "shoved through a pipe".
In Haskell, you call createPipe to access this
interface.

Having a pipe is the first step to being able to pipe data between
external programs. We must also be able to redirect the output of a
program to a pipe, and the input of another program from a pipe. The
Haskell function dupTo accomplishes this. It takes
a FD and makes a copy of it at another FD number. POSIX FDs for
standard input, standard output, and standard error have the predefined
FD numbers of 0, 1, and 2, respectively. By renumbering an endpoint of
a pipe to one of those numbers, we effectively can cause programs to
have their input or output redirected.

There is another piece of the puzzle, however. We can't just use
dupTo before a call such as
rawSystem because this would mess up the standard
input or output of our main Haskell process. Moreover,
rawSystem blocks until the invoked program executes,
leaving us no way to start multiple processes running in parallel. To
make this happen, we must use forkProcess.
This is a very special function. It actually makes a copy of the
currently-running program and you wind up with two copies of the
program running at the same time. Haskell's
forkProcess function takes a function to execute in
the new process (known as the child). We have that function call
dupTo. After it has done that, it calls
executeFile to actually invoke the command. This is
also a special function: if all goes well, it never
returns. That's because executeFile
replaces the running process with a different program. Eventually, the
original Haskell process will call getProcessStatus
to wait for the child processes to terminate and learn of their exit
codes.

Whenever you run a command on POSIX systems, whether you've just typed
ls on the command line or used
rawSystem in Haskell, under the hood,
forkProcess, executeFile, and
getProcessStatus (or their C equivalents) are always
being used. To set up pipes, we are duplicating the process that the
system uses to start up programs, and adding a few steps involving
piping and redirection along the way.

There are a few other housekeeping things we must be careful about.
When you call forkProcess, just about everything
about your program is cloned[46] That includes
the set of open file descriptors (handles). Programs detect when
they're done receiving input from a pipe by checking the end-of-file
indicator. When the process at the writing end of a pipe closes the
pipe, the process at the reading end will receive an end-of-file
indication. However, if the writing file descriptor exists in more
than one process, the end-of-file indicator won't be sent until all
processes have closed that particular FD. Therefore, we must keep
track of which FDs are opened so we can close them all in the child
processes. We must also close the child ends of the pipes in the
parent process as soon as possible.

We start by running a simple command, pwd, which
just prints the name of the current working directory. We pass
[] for the list of arguments, because
pwd doesn't need any arguments. Due to the
typeclasses used, Haskell can't infer the type of
[], so we specifically mention that it's a
String.

Then we get into more complex commands. We run
ls, sending it through grep.
At the end, we set up a pipe to run the exact same command that we
ran via a shell-built pipe at the start of this section. It's not
yet as pleasant as it was in the shell, but then again our program is
still relatively simple when compared to the shell.

Let's look at the program. The very first line has a special
OPTIONS_GHC clause. This is the same as passing
-fglasgow-exts to ghc or ghci. We are using a
GHC extension that permits us to use a (String,
[String]) type as an instance of a
typeclass.[47] By putting
it in the source file, we don't have to remember to specify it every
time we use this module.

After the import lines, we define a few types.
First, we define type SysCommand = (String,
[String]) as an alias. This is the type a command to be
executed by the system will take. We used data of this type for each
command in the example execution above. The
CommandResult type represents the result from
executing a given command, and the CloseFDs type
represents the list of FDs that we must close upon forking a new
child process.

Next, we define a class named CommandLike. This
class will be used to run "things", where a "thing" might be a
standalone program, a pipe set up between two or more programs, or in
the future, even pure Haskell functions. To be a member of this
class, only one function -- invoke -- needs to be
present for a given type. This will let us use
runIO to start either a standalone command or a
pipeline. It will also be useful for defining a pipeline, since we
may have a whole stack of commands on one or both sides of a given
command.

Our piping infrastructure is going to use strings as the way of
sending data from one process to another. We can take advantage of
Haskell's support for lazy reading via hGetContents while reading
data, and use forkIO to let writing occur in the
background. This will work well, although not as fast as connecting
the endpoints of two processes directly together.[48] It makes implementation quite
simple, however. We need only take care to do nothing that would
require the entire String to be buffered, and let Haskell's
laziness do the rest.

Next, we define an instance of
CommandLike for SysCommand. We
create two pipes: one to use for the new process's standard input,
and the other for its standard output. This creates four endpoints,
and thus four file descriptors. We add the parent file descriptors
to the list of those that must be closed in all children. These
would be the write end of the child's standard input, and the read
end of the child's standard output. Next, we fork the child process.
In the parent, we can then close the file descriptors that correspond
to the child. We can't do that before the fork, because then they
wouldn't be available to the child. We obtain a handle for the
stdinwrite file descriptor, and start a thread via
forkIO to write the input data to it. We then
define waitfunc, which is the action that the
caller will invoke when it is ready to wait for the called process to
terminate. Meanwhile, the child uses dupTo,
closes the file descriptors it doesn't need, and executes the
command.

Next, we define some utility functions to manage the list of file
descriptors. After that, we define the tools that help set up
pipelines. First, we define a new type
PipeCommand that has a source and destination.
Both the source and destination must be members of
CommandLike. We also define the
-|- convenience operator. Then, we make
PipeCommand an instance of
CommandLike. Its invoke
implementation starts the first command with the given input, obtains
its output, and passes that output to the invocation of the second
command. It then returns the output of the second command, and
causes the getExitStatus function to wait for and
check the exit statuses from both commands.

We finish by defining runIO. This function
establishes the list of FDs that must be closed in the client, starts
the command, displays its output, and checks its exit status.

Better Piping

Our previous example solved the basic need of letting us set up
shell-like pipes. There are some other features that it would be
nice to have though:

Supporting more shell-like
syntax

Letting people pipe data into external programs or
regular Haskell functions, freely mixing and matching the
two

Returning the final output and exit code in a way
that Haskell programs can readily use

Fortunately, we already have most of the pieces to support this in
place. We need only add a few more instances of
CommandLike to support this, and a few more
functions similar to runIO. Here is a revised
example that implements all of these features:

A new CommandLike instance for
String that uses the shell to evaluate and invoke the string.

New CommandLike instances for
String -> IO String and various other types
that are implemented in terms of this one. These process Haskell
functions as commands.

A new RunResult typeclass that
defines a function run that returns
information about the command in many different ways. See the
comments in the source for more information.
runIO is now just an alias for one particular
RunResult instance.

Final Words on Pipes

We have developed a sophisticated system here. We warned you earlier
that POSIX can be complex. One other thing we need to highlight: you
must always make sure to evaluate the String returned by these
functions before you attempt to evaluate the exit code of the child
process. The child process will often not exit until it can write all of
its data, and if you do this in the wrong order, your program will
hang.

In this chapter, we have developed, from the ground up, a simplified
version of HSH. If you wish to use these shell-like capabilities in
your own programs, we recommend HSH instead of the example developed
here due to optimizations present in HSH. HSH also comes with a
larger set of utility functions and more capabilities, but the source
code behind the library is much more complex and large. Some of the
utility functions presented here, in fact, were copied verbatim from
HSH. HSH is available from http://software.complete.org/hsh.

[43] There is also a function
system that takes only a single string and
passes it through the shell to parse. We recommend using
rawSystem instead, because the shell attaches
special meaning to certain characters, which could lead to security
issues or unexpected behavior.

[44] Some will note that UTC defines leap seconds
at irregular intervals. The POSIX standard, which Haskell
follows, states that every day is exactly 86,400 seconds in
length in its representation, so you need not be concerned about
leap seconds when performing routine calculations. The exact
manner of handling leap seconds is system-dependent and complex,
though usually it can be explained as having a "long second".
This nuance is generally only of interest when performing precise
subsecond calculations.

[47] This extension is well-supported in the
Haskell community; Hugs users can access the same thing with
hugs -98 +o.

[48] The
Haskell library HSH provides a similar API to that presented
here, but uses a more efficient (and much more complex) mechanism
of connecting pipes directly between external processes without
the data needing to pass through Haskell. This is the same
approach that the shell takes, and reduces the CPU load of
handling piping.