-- Output of "du -sb" -- which is our input -- consists of many lines,

+

−

-- each of which describes single directory

+

−

parseInput =

+

−

do dirs <- many dirAndSize

+

−

eof

+

−

return dirs

+

−

+

−

-- Information about single direcory is a size (number), some spaces,

+

−

-- then directory name, which extends till newline

+

−

data Dir = Dir Int String deriving Show

+

−

dirAndSize =

+

−

do size <- many1 digit

+

−

spaces

+

−

dir_name <- anyChar `manyTill` newline

+

−

return $ Dir (read size) dir_name

+

−

+

−

main = do input <- getContents

+

−

putStrLn ("DEBUG: got input " ++ input)

+

−

let dirs = case parse parseInput "stdin" input of

+

−

Left err -> error $ "Input:\n" ++ show input ++

+

−

"\nError:\n" ++ show err

+

−

Right result -> result

+

−

putStrLn "DEBUG: parsed:"; print dirs

+

−

-- compute solution and print it

+

−

</haskell>

+

If you followed advice to put your code under version control, you

If you followed advice to put your code under version control, you

Line 565:

Line 571:

<haskell>

<haskell>

+

-- Taken from 'cd-fit-3-1.hs'

data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

</haskell>

</haskell>

----

----

−

Exercise: examine types of "Dir", "dir_size" and "dir_name"

+

'''Exercise:''' examine types of "Dir", "dir_size" and "dir_name"

----

----

Line 575:

Line 582:

"dir_name d" to get its name, provided that "d" is of type "Dir".

"dir_name d" to get its name, provided that "d" is of type "Dir".

−

The Greedy algorithm sorts directories from the biggest down, and tries to puts

+

The Greedy algorithm sorts directories from the biggest down, and tries to put

them on CD one by one, until there is no room for more. We will need to track

them on CD one by one, until there is no room for more. We will need to track

which directories we added to CD, so let's add another datatype, and code this

which directories we added to CD, so let's add another datatype, and code this

Line 581:

Line 588:

<haskell>

<haskell>

+

-- Taken from 'cd-fit-3-1.hs'

import Data.List (sortBy)

import Data.List (sortBy)

Line 606:

Line 614:

----

----

I'll highlight the areas which you could explore on your own (using other nice

I'll highlight the areas which you could explore on your own (using other nice

−

tutorials out there, of which I especially recommend "Yet Another Haskell

+

tutorials out there, of which I especially recommend "[[Yet Another Haskell Tutorial]]" by Hal Daume):

−

Tutorial" by Hal Daume):

+

* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.

* We choose to import a single function "sortBy" from a module [[Data.List]], not the whole thing.

* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!

* Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!

−

* To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".

+

* To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".

* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.

* To code the quite complex function "maybe_add_dir", we introduced several '''local definitions''' in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.

* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"

* Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"

-- We either fail to add any dirs (probably, because all of them too big).

-- We either fail to add any dirs (probably, because all of them too big).

Line 913:

Line 929:

----

----

−

Exercises:

+

'''Exercises:'''

* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?

* Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?

−

* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)

+

* <tt>[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ]</tt> is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: <tt>let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]</tt>? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)

* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.

* Notice that in order to code quite complex implementation of <tt>precomputeDisksFor</tt> we split it up in several smaller pieces and put them as a '''local bindings''' inside '''let''' clause.

* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line

* Notice that we use '''pattern matching''' to both define <tt>bestKnap</tt> on case-by-case basis and to "peer into" ('''de-construct''') <tt>DirPack</tt> in the <tt>let (DirPack s ds)=precomp!!(limit - dir_size d)</tt> line

Line 921:

Line 937:

----

----

+

+

Before we move any further, let's do a small cosmetic change to our

+

code. Right now our solution uses 'Int' to store directory size. In

+

Haskell, 'Int' is a platform-dependent integer, which imposes certain

+

limitations on the values of this type. Attempt to compute the value

+

of type 'Int' that exceeds the bounds will result in overflow error.

+

Standard Haskell libraries have special typeclass <hask>Bounded</hask>, which allows to define and examine such bounds:

+

+

Prelude> :i Bounded

+

class Bounded a where

+

minBound :: a

+

maxBound :: a

+

-- skip --

+

instance Bounded Int -- Imported from GHC.Enum

+

+

We see that 'Int' is indeed bounded. Let's examine the bounds:

+

+

Prelude> minBound :: Int

+

-2147483648

+

Prelude> maxBound :: Int

+

2147483647

+

Prelude>

+

+

Those of you who are C-literate, will spot at once that in this case

+

the 'Int' is so-called "signed 32-bit integer", which means that we

+

would run into errors trying to operate on directories/directory packs

Type errors went away, but careful reader will spot at once that when expression <hask>(limit - dir_size d)</hask> will exceed the bounds

+

for <hask>Int</hask>, overflow will occur, and we will not access the

+

correct list element. Don't worry, we will deal with this in a short while.

Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:

Now, lets code the QuickCheck test for this function along the lines of the test for <tt>greedy_pack</tt>:

<haskell>

<haskell>

+

-- Taken from 'cd-fit-4-2.hs'

prop_dynamic_pack_is_fixpoint ds =

prop_dynamic_pack_is_fixpoint ds =

let pack = dynamic_pack ds

let pack = dynamic_pack ds

Line 930:

Line 1,062:

</haskell>

</haskell>

−

Now, lets try to run (DONT PANIC and save all you work in other applications first!):

+

Now, lets try to run (DON'T PANIC and save all you work in other applications first!):

−

*Main> quickCheck dynamic_pack_is_fixpoint

+

*Main> quickCheck prop_dynamic_pack_is_fixpoint

−

Now, you took my advice seriously, dont you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.

+

Now, you took my advice seriously, don't you? And you did have your '''Ctrl-C''' handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by <tt>ghci</tt> process, which you hopefully interrupted soon enough by pressing '''Ctrl-C'''.

What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.

What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.

−

Let's see. Since the have called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavoir.

+

Let's see. Since we called <tt>dynamic_pack</tt> and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.

−

Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> muches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?

+

Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, <tt>greedy_pack</tt> munches them without significant memory consumption), the size of the input most probably is not the issue. However, <tt>dynamic_pack_is_fixpoint</tt> is building quite a huge list internally (via <tt>precomputeDisksFor</tt>). Could this be a problem?

Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:

Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by <tt>precomputeDisksFor</tt>:

1 Preface: DON'T PANIC!

Recent experiences from a few of my fellow C++/Java programmers
indicate that they read various Haskell tutorials with "exponential
speedup" (think about how TCP/IP session starts up). They start slow
and cautious, but when they see that the first 3-5 pages do not
contain "anything interesting" in terms of code and examples, they
begin skipping paragraphs, then chapters, then whole pages, only to
slow down - often to a complete halt - somewhere on page 50, finding
themselves in the thick of concepts like "type classes", "type
constructors", "monadic IO", at which point they usually panic, think
of a perfectly rational excuse not to read further anymore, and
happily forget this sad and scary encounter with Haskell (as human
beings usually tend to forget sad and scary things).

This text intends to introduce the reader to the practical aspects of Haskell
from the very beginning (plans for the first chapters include: I/O, darcs,
Parsec, QuickCheck, profiling and debugging, to mention a few). The reader
is expected to know (where to find) at least the basics of Haskell: how to run
"hugs" or "ghci", that layout is 2-dimensional, etc. Other than that, we do
not plan to take radical leaps, and will go one step at a time in order not to
lose the reader along the way. So DON'T PANIC, take your towel with you and
read along.

In case you've skipped over the previous paragraph, I would like
to stress out once again that Haskell is sensitive to indentation and
spacing, so pay attention to that during cut-n-pastes or manual
alignment of code in the text editor with proportional fonts.

Oh, almost forgot: author is very interested in ANY feedback. Drop him a line
or a word (see Adept for contact info) or submit
patches to the tutorial via darcs (
repository is here) or directly to this
Wiki.

2 Chapter 1: Ubiquitous "Hello world!" and other ways to do IO in Haskell

Each chapter will be dedicated to one small real-life task which we will
complete from the ground up.

So here is the task for this chapter: in order to free up space on
your hard drive for all the Haskell code you are going to write in the
nearest future, you are going to archive some of the old and dusty
information on CDs and DVDs. While CD (or DVD) burning itself is easy
these days, it usually takes some (or quite a lot of) time to decide
how to put several GB of digital photos on CD-Rs, when directories
with images range from 10 to 300 Mb's in size, and you don't want to
burn half-full (or half-empty) CD-Rs.

So, the task is to write a program which will help us put a given
collection of directories on the minimum possible amount of media,
while packing the media as tightly as possible. Let's name this program
"cd-fit".

Oh. Wait. Let's do the usual "hello world" thing, before we forget about it,
and then move on to more interesting things:

-- Taken from 'hello.hs'-- From now on, a comment at the beginning of the code snippet-- will specify the file which contain the full program from-- which the snippet is taken. You can get the code from the darcs-- repository "http://adept.linux.kiev.ua:8080/repos/hhgtth" by issuing-- command "darcs get http://adept.linux.kiev.ua:8080/repos/hhgtth"module Main where
main =putStrLn"Hello world!"

Run it:

$ runhaskell ./hello.hs
Hello world!

OK, we've done it. Move along now, nothing interesting here :)

Any serious development must be done with the help of a version control
system, and we will not make an exception. We will use the modern
distributed version control system "darcs". "Modern" means that it is
written in Haskell, "distributed" means that each working copy is
a repository in itself.

First, let's create an empty directory for all our code, and invoke
"darcs init" there, which will create subdirectory "_darcs" to store
all version-control-related stuff there.

Fire up your favorite editor and create a new file called "cd-fit.hs"
in our working directory. Now let's think for a moment about how our
program will operate and express it in pseudocode:

main =Read list of directories and their sizes.
Decide how to fit them on CD-Rs.
Print solution.

Sounds reasonable? I thought so.

Let's simplify our life a little and assume for now that we will
compute directory sizes somewhere outside our program (for example,
with "du -sb *") and read this information from stdin.
Now let me convert all this to Haskell:

-- Taken from 'cd-fit-1-1.hs'module Main where
main =do input <-getContentsputStrLn("DEBUG: got input "++ input)-- compute solution and print it

Not really working, but pretty close to plain English, eh? Let's stop
for a moment and look more closely at what's written here line-by-line

Let's begin from the top:

-- Taken from 'cd-fit-1-1.hs'
input <-getContents

This is an example of the Haskell syntax for doing IO (namely, input). This
line is an instruction to read all the information available from the stdin,
return it as a single string, and bind it to the symbol "input", so we can
process this string any way we want.

How did I know that? Did I memorize all the functions by heart? Of course not!
Each function has a type, which, along with function's name, usually tells a
lot about what a function will do.

Let's fire up an interactive Haskell environment and examine this function
up close:

We see that "getContents" is a function without arguments, that will return
"IO String". Prefix "IO" meant that this is an IO action. It will return
String, when evaluated. Action will be evaluated as soon as we use "<-" to
bind its result to some symbol.

Note that "<-" is not a fancy way to assign value to variable. It is a way to
evaluate (execute) IO actions, in other words - to actually do some I/O and
return its result (if any).

We can choose not to evaluate the action obtained from "getContents", but rather carry it around a bit and evaluate later:

let x =getContents-- 300 lines of code here
input <- x

So, as you see, IO actions can act like an ordinary values. Suppose that we
have built a list of IO actions and have found a way to execute them one by one.
This would be a way to simulate imperative programming with its notion of
"order of execution".

Haskell allows you to do better than that.

The standard language library (named "Prelude", by the way) provides
us with lots of functions that return useful primitive IO actions. In
order to combine them to produce an even more complex actions, we use a "do":

When will all this actually be executed? Answer: as soon as we evaluate "c"
using the "<-" (if it returns result, as "getContents" does) or just
by using it as a function name (if it does not return a result, as "print"
does):

process =doputStrLn"Will do some processing"
c
putStrLn"Done"

Notice that we took a bunch of functions ("someAction", "someOtherAction",
"print", "putStrLn") and using "do" created from them a new function, which we
bound to symbol "c". Now we could use "c" as a building block to produce an even
more complex function, "process", and we could carry this on and on.
Eventually, some of the functions will be mentioned in the code of function
"main", to which the ultimate topmost IO action any Haskell program is bound.

When will the "main" be executed/evaluated/forced? As soon as we run the
program. Read this twice and try to comprehend:

The execution of a Haskell program is an evaluation of the symbol "main" to
which we have bound an IO action. Via evaluation we obtain the result of that
action.

Readers familiar with advanced C++ or Java programming and that arcane body of
knowledge named "OOP Design Patterns" might note that "build actions from
actions" and "evaluate actions to get result" is essentially a "Command
pattern" and "Composition pattern" combined. Good news: in Haskell you get them
for all your IO, and get them for free :)

Notice how we carefully indent lines so that source looks neat?
Actually, Haskell code has to be aligned this way, or it will not
compile. If you use tabulation to indent your sources, take into
account that Haskell compilers assume that tabstop is 8 characters
wide.

Often people complain that it is very difficult to write Haskell
because it requires them to align code. Actually, this is not true. If
you align your code, compiler will guess the beginnings and endings of
syntactic blocks. However, if you don't want to indent your code, you
could explicitly specify end of each and every expression and use
arbitrary layout as in this example:

We see that "main" is indeed an IO action which will return nothing
when evaluated. When combining actions with "do", the type of the
result will be the type of the last action, and "putStrLn something" has type
"IO ()":

Oh, by the way: have you noticed that we actually compiled our first
Haskell program in order to examine "main"? :)

let's celebrate that by putting it under version control: execute
"darcs add cd-fit.hs" and "darcs record", answer "y" to all questions
and provide a commit comment "Skeleton of cd-fit.hs"

Let's try to run it:

$ echo "foo" | runhaskell cd-fit.hs
DEBUG: got input foo

Exercises:

Try to write a program that takes your name from the stdin and greets you (keywords: getLine, putStrLn);

Try to write a program that asks for you name, reads it, greets you, asks for your favorite color, and prints it back (keywords: getLine, putStrLn).

3 Chapter 2: Parsing the input

OK, now that we have proper understanding of the powers of Haskell IO
(and are awed by them, I hope), let's forget about IO and actually do
some useful work.

As you remember, we set forth to pack some CD-Rs as tightly as
possible with data scattered in several input directories. We assume
that "du -sb" will compute the sizes of input directories and output
something like:

Our next task is to parse that input into some suitable internal
representation.

For that we will use powerful library of parsing combinators named
"Parsec" which ships with most Haskell implementations.

Much like the IO facilities we have seen in the first chapter, this
library provides a set of basic parsers and means to combine into more
complex parsing constructs.

Unlike other tools in this area (lex/yacc or JavaCC to name a few),
Parsec parsers do not require a separate preprocessing stage. Since in
Haskell we can return function as a result of function and thus
construct functions "from the thin air", there is no need for a separate
syntax for parser description. But enough advertisements, let's actually
do some parsing:

Just add those lines to "cd-fit.hs", between the declaration of
the Main module and the definition of main.

Here we see quite a lot of new
things, and several those that we know already.
First of all, note the familiar "do" construct, which, as we know, is
used to combine IO actions to produce new IO actions. Here we use it
to combine "parsing" actions into new "parsing" actions. Does this
mean that "parsing" implies "doing IO"? Not at all. Thing is, I must
admit that I lied to you - "do" is used not only to combine IO
actions. "Do" is used to combine any kind of so-called monadic
actions or monadic values together.

Think about monad as a "design pattern" in the functional world.
Monad is a way to hide from the user (programmer) all the machinery
required for complex functionality to operate.

As you might have heard, Haskell has no notion of "assignment",
"mutable state", "variables", and is a "pure functional language",
which means that every function called with the same input parameters
will return exactly the same result. Meanwhile "doing IO" requires
hauling around file handles and their states and dealing with IO
errors. "Parsing" requires to track position in the input and dealing
with parsing errors.

In both cases Wise Men Who Wrote Libraries cared for our needs and
hide all underlying complexities from us, exposing the API of their
libraries (IO and parsing) in the form of "monadic action" which we
are free to combine as we see fit.

Think of programming with monads as of doing the remodelling with the
help of professional remodelling crew. You describe sequence of
actions on the piece of paper (that's us writing in "do" notation),
and then, when required, that sequence will be evaluated by the
remodelling crew ("in the monad") which will provide you with end
result, hiding all the underlying complexity (how to prepare the
paint, which nails to choose, etc) from you.

let's use the interactive Haskell environment to decipher all the
instructions we've written for the parsing library. As usually, we'll
go top-down:

Assuming (well, take my word for it) that "GenParser Char st" is our
parsing monad, we could see that "parseInput", when evaluated, will
produce a list of "Dir", and "dirAndSize", when evaluated, will
produce "Dir". Assuming that "Dir" somehow represents information
about single directory, that is pretty much what we wanted, isn't it?

Let's see what a "Dir" means. We defined datatype Dir as a record,
which holds an Int and a String:

, which would define datatype "Dir" with data constructor "D".
However, traditionally name of the datatype and its constructor are
chosen to be the same.

Clause "deriving Show" instructs the compiler to make enough code "behind
the curtains" to make this datatype conform to the interface of
the type class Show. We will explain type classes later, for
now let's just say that this will allow us to "print" instances of
"Dir".

Exercises:

examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.

compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called partial application.

OK. So, we combined a lot of primitive parsing actions to get ourselves a
parser for output of "du -sb". How can we actually parse something? the Parsec library supplies us with function "parse":

At first the type might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise type.

Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.

OK, back to the code. We can observe that the "parser" is a function, which,
given a parser, a name of the source file or channel (f.e. "stdin"), and
source data (String, which is a list of "Char"s, which is written "[Char]"),
will either produce parse error, or parse us a list of "Dir".

Datatype "Either" is an example of datatype whose constructor has name, different
from the name of the datatype. In fact, "Either" has two constructors:

dataEither a b = Left a | Right b

In order to understand better what does this mean consider the following
example:

*Main> :t Left 'a'
Left 'a' :: Either Char b
*Main> :t Right "aaa"
Right "aaa" :: Either a [Char]
*Main>

You see that "Either" is a union (much like the C/C++ "union") which could
hold value of one of the two distinct types. However, unlike C/C++ "union",
when presented with value of type "Either Int Char" we could immediately see
whether its an Int or a Char - by looking at the constructor which was used to
produce the value. Such datatypes are called "tagged unions", and they are
another power tool in the Haskell toolset.

Did you also notice that we provide "parse" with parser, which is a monadic
value, but receive not a new monadic value, but a parsing result? That is
because "parse" is an evaluator for "Parser" monad, much like the GHC or Hugs runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.

let's extend our "main" function to use "parse" and actually parse the input
and show us the parsed data structures:

If you followed advice to put your code under version control, you
could now use "darcs whatsnew" or "darcs diff -u" to examine your
changes to the previous version. Use "darcs record" to commit them. As
an exercise, first record the changes "outside" of function "main" and
then record the changes in "main". Do "darcs changes" to examine a
list of changes you've recorded so far.

From now on, we could use "dir_size d" to get a size of directory, and
"dir_name d" to get its name, provided that "d" is of type "Dir".

The Greedy algorithm sorts directories from the biggest down, and tries to put
them on CD one by one, until there is no room for more. We will need to track
which directories we added to CD, so let's add another datatype, and code this
simple packing algorithm:

-- Taken from 'cd-fit-3-1.hs'import Data.List (sortBy)-- DirPack holds a set of directories which are to be stored on single CD.-- 'pack_size' could be calculated, but we will store it separately to reduce-- amount of calculationdata DirPack = DirPack {pack_size::Int, dirs::[Dir]}derivingShow-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now
media_size =700*1024*1024-- Greedy packer tries to add directories one by one to initially empty 'DirPack'
greedy_pack dirs =foldl maybe_add_dir (DirPack 0[])$ sortBy cmpSize dirs
where
cmpSize d1 d2 =compare(dir_size d1)(dir_size d2)-- Helper function, which only adds directory "d" to the pack "p" when new-- total size does not exceed media_size
maybe_add_dir p d =let new_size = pack_size p + dir_size d
new_dirs = d:(dirs p)inif new_size > media_size then p else DirPack new_size new_dirs

I'll highlight the areas which you could explore on your own (using other nice
tutorials out there, of which I especially recommend "Yet Another Haskell Tutorial" by Hal Daume):

We choose to import a single function "sortBy" from a module Data.List, not the whole thing.

Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!

To sort list of "Dir" by size only, we use custom sort function and parametrized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".

To code the quite complex function "maybe_add_dir", we introduced several local definitions in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.

Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"

In order to actually use our greedy packer we must call it from our "main"
function, so let's add a lines:

Now it is time to test our creation. We could do it by actually running it in
the wild like this:

$ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs

This will prove that our code seems to be working. At least, this once. How
about establishing with reasonable degree of certainty that our code, parts
and the whole, works properly, and doing so in re-usable manner? In other
words, how about writing some test?

Java programmers used to JUnit probably thought about screens of boiler-plate
code and hand-coded method invocations. Never fear, we will not do anything as
silly :)

QuickCheck is a tool to do automated testing of your functions using
(semi)random input data. In the spirit of "100b of code examples is worth 1kb of
praise" let's show the code for testing the following property: An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:

We've just seen our "greedy_pack" run on a 100 completely (well, almost
completely) random lists of "Dir"s, and it seems that property indeed holds.

let's dissect the code. The most intriguing part is "instance Arbitrary Dir
where", which declares that "Dir" is an instance of typeclass "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a
bit.

What is a typeclass? A typeclass is a Haskell way of dealing with the
following situation: suppose that you are writing a library of useful
functions and you don't know in advance how exactly they will be used, so you
want to make them generic. Now, on one hand you don't want to restrict your
users to certain type (e.g. String). On the other hand, you want to enforce
the convention that arguments for your function must satisfy a certain set of
constraints. That is where typeclass comes in handy.

Think of typeclass as a contract (or "interface", in Java terms) that
your type must fulfill in order to be admitted as an argument to certain
functions.

It could be read this way: "Any type (let's name it 'a') could be a member of the class Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".

Now, if you write a function which operates on its arguments solely by means
of "arbitrary" and "coarbitrary", you can be sure that this function will work
on any type which is an instance of "Arbitrary"!

let's say it again. Someone (maybe even you) writes the code (API or library),
which requires that input values implement certain interfaces, which is
described in terms of functions. Once you show how your type implements this
interface you are free to use API or library.

Consider the function "sort" from standard library:

*Main> :t Data.List.sort
Data.List.sort :: (Ord a) => [a] -> [a]

We see that it sorts lists of any values which are instance of typeclass
"Ord". Let's examine that class:

We see a couple of interesting things: first, there is an additional
requirement listed: in order to be an instance of "Ord", type must first be an
instance of typeclass "Eq". Then, we see that there is an awful lot of
functions to define in order to be an instance of "Ord". Wait a second, isn't
it silly to define both (<) and (>=) when one could be expressed via another?

Right you are! Usually, typeclass contains several "default" implementation
for its functions, when it is possible to express them through each other (as
it is with "Ord"). In this case it is possible to supply only a minimal
definition (which in case of "Ord" consists of any single function) and others
will be automatically derived. If you supplied fewer functions than are required
for minimal implementation, the compiler/interpreter will say so and
explain which functions you still have to define.

Once again, we see that a lot of types are already instances of typeclass Ord, and thus we are able to sort them.

See that "deriving" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of
that type!

Side note for Java programmers: just imagine java compiler which derives code
for "implements Storable" for you...

Side note for C++ programmers: just imagine that deep copy constructors are
being written for you by compiler....

Exercises:

Examine typeclasses Eq and Show

Examine types of (==) and "print"

Try to make "Dir" instance of "Eq"

OK, back to our tests. So, what we have had to do in order to make "Dir" an
instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's
examine it up close:

*Main> :t arbitrary
arbitrary :: (Arbitrary a) => Gen a

See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser
a" which we've seen already. This is yet another example of action-returning
function, which could be used inside "do"-notation. (You might ask yourself,
wouldn't it be useful to generalize that convenient concept of actions and
"do"? Of course! It is already done, the concept is called "Monad" and we will talk about it in Chapter 400 :) )

Since 'a' here is a type variable which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?

We have used the library-provided functions "choose" and "elements" to build up
"gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my
word on that. Find a way to check types of "gen_name" and "gen_size"). Since
"Int" and "String" are components of "Dir", we sure must be able to use "Gen
Int" and "Gen String" to build "Gen Dir". But where is the "do" block for
that? There is none, and there is only single call to "liftM2".

Hopefully, this will all make sense after you read it for the third
time ;)

Oh, by the way - don't forget to "darcs record" your changes!

5 Chapter 4: REALLY packing the knapsack this time

In this chapter we are going to write another not-so-trivial packing
method, compare packing methods efficiency, and learn something new
about debugging and profiling of the Haskell programs along the way.

It might not be immediately obvious whether our packing algorithm is
effective, and if yes - in which particular way? Whether it's runtime,
memory consumption or result are of sufficient quality, are there any
alternative algorithms, and how do they compare to each other?

Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.

This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:

-- Taken from 'cd-fit-4-1.hs'------------------------------------------------------------------------------------ Dynamic programming solution to the knapsack (or, rather, disk) packing problem---- Let the `bestDisk x' be the "most tightly packed" disk of total -- size no more than `x'.
precomputeDisksFor ::[Dir]->[DirPack]
precomputeDisksFor dirs =-- By calculating `bestDisk' for all possible disk sizes, we could-- obtain a solution for particular case by simple lookup in our list of-- solutions :)let precomp =map bestDisk [0..]-- How to calculate `bestDisk'? Lets opt for a recursive definition:-- Recursion base: best packed disk of size 0 is empty
bestDisk 0= DirPack 0[]-- Recursion step: for size `limit`, bigger than 0, best packed disk is-- computed as follows:
bestDisk limit =-- 1. Take all non-empty dirs that could possibly fit to that disk by itself.-- Consider them one by one. Let the size of particular dir be `dir_size d'.-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus-- producing the disk of size <= limit. Lets do that for all "candidate"-- dirs that are not yet on our disk:case[ DirPack (dir_size d + s)(d:ds)| d <-filter((inRange (1,limit)).dir_size ) dirs
, dir_size d >0,let(DirPack s ds)=precomp!!(limit - dir_size d), d `notElem` ds
]of-- We either fail to add any dirs (probably, because all of them too big).-- Well, just report that disk must be left empty:[]-> DirPack 0[]-- Or we produce some alternative packings. Let's choose the best of them all:
packs -> maximumBy cmpSize packs
cmpSize a b =compare(pack_size a)(pack_size b)in precomp
-- When we precomputed disk of all possible sizes for the given set of dirs, solution to -- particular problem is simple: just take the solution for the required 'media_size' and-- that's it!
dynamic_pack dirs =(precomputeDisksFor dirs)!!media_size

Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?

Exercises:

Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?

[ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ] is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]? Could you write (with help of decent tutorial) de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)

Notice that in order to code quite complex implementation of precomputeDisksFor we split it up in several smaller pieces and put them as a local bindings inside let clause.

Notice that we use pattern matching to both define bestKnap on case-by-case basis and to "peer into" (de-construct) DirPack in the let (DirPack s ds)=precomp!!(limit - dir_size d) line

Notice how we use function composition to compose complex condition to filter the list of dirs

Before we move any further, let's do a small cosmetic change to our
code. Right now our solution uses 'Int' to store directory size. In
Haskell, 'Int' is a platform-dependent integer, which imposes certain
limitations on the values of this type. Attempt to compute the value
of type 'Int' that exceeds the bounds will result in overflow error.

Those of you who are C-literate, will spot at once that in this case
the 'Int' is so-called "signed 32-bit integer", which means that we
would run into errors trying to operate on directories/directory packs
which are bigger than 2 GB.

Luckily for us, Haskell has integers of arbitrary precision (limited
only by the amount of available memory). The appropriate type is
called 'Integer':

Now, lets try to run (DON'T PANIC and save all you work in other applications first!):

*Main> quickCheck prop_dynamic_pack_is_fixpoint

Now, you took my advice seriously, don't you? And you did have your Ctrl-C handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by ghci process, which you hopefully interrupted soon enough by pressing Ctrl-C.

What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.

Let's see. Since we called dynamic_pack and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.

Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, greedy_pack munches them without significant memory consumption), the size of the input most probably is not the issue. However, dynamic_pack_is_fixpoint is building quite a huge list internally (via precomputeDisksFor). Could this be a problem?

Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by precomputeDisksFor:

Examine column of "individual %alloc". As we thought, all memory was
allocated within precomputeDisksFor. However, amount of
memory allocated (more than 700 MB, according to the line "total
alloc") seems to be a little too much for our simple task. We will dig
deeper and find where we a wasting it.

Let's examine memory consumption a little closer via so-called "heap
profiles". Run ./cd-fit +RTS -hb. This produces "biographical
heap profile", which tells us how various parts of the memory were
used during the program run time. Heap profile was saved to
"cd-fit.hp". It is next to impossible to read and comprehend it as is,
so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which
is worth a thousand words. View it with "gv" or "ghostview" or "full
Adobe Acrobat (not Reader)". (This and subsequent pictures are
not attached here).

Notice that most of the graph is taken up by region marked as "VOID".
This means that memory allocated was never used. Notice that there is
no areas marked as "USE", "LAG" or "DRAG". Seems like our
program hardly uses any of the allocated memory at all. Wait a
minute! How could that be? Surely it must use something when it packs
to the imaginary disks of 50000 bytes those random-generated
directories which are 10 to 1400 Mb in size.... Oops. Severe size
mismatch. We should have spotted it earlier, when we were timing
precomputeDisksFor. Scroll back and observe how each run
returned the very same result - empty directory set.

Our random directories are too big, but nevertheless code spends time
and memory trying to "pack" them. Obviously,
precomputeDisksFor (which is responsible for 90% of total
memory consumption and run time) is flawed in some way.

Let's take a closer look at what takes up so much memory. Run
./cd-fit +RTS -h -hbvoid and produce PostScript picture for
this memory profile. This will give us detailed breakdown of all
memory whose "biography" shows that it's been "VOID" (unused). My
picture (and I presume that yours as well) shows that VOID memory
comprises of "thunks" labeled "precomputeDisksFor/pre...". We could
safely assume that second word would be "precomp" (You wonder why?
Look again at the code and try to find function named "pre.*" which is
called from inside precomputeDisksFor)

This means that memory has been taken by the list generated inside
"precomp". Rumor has it that memory leaks with Haskell are caused by
either too little laziness or too much laziness. It seems like we have
too little laziness here: we evaluate more elements of the list that
we actually need and keep them from being garbage-collected.

Obviously, the whole list generated by "precomp" must be kept in
memory for such lookups, since we can't be sure that some element
could be garbage collected and will not be needed again.

Let's rewrite the code to eliminate the list (incidentally, this will also deal with the possible Int overflow while accessing the "precomp" via (!!) operator):

-- Taken from 'cd-fit-4-4.hs'-- Let the `bestDisk x' be the "most tightly packed" disk of total -- size no more than `x'.-- How to calculate `bestDisk'? Lets opt for a recursive definition:-- Recursion base: best packed disk of size 0 is empty and best-packed-- disk for empty list of directories on it is also empty.
bestDisk 0_= DirPack 0[]
bestDisk _[]= DirPack 0[]-- Recursion step: for size `limit`, bigger than 0, best packed disk is-- computed as follows:
bestDisk limit dirs =-- Take all non-empty dirs that could possibly fit to that disk by itself.-- Consider them one by one. Let the size of particular dir be `dir_size d'.-- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus-- producing the disk of size <= limit. Lets do that for all "candidate"-- dirs that are not yet on our disk:case[ DirPack (dir_size d + s)(d:ds)| d <-filter((inRange (1,limit)).dir_size ) dirs
, dir_size d >0,let(DirPack s ds)= bestDisk (limit - dir_size d) dirs
, d `notElem` ds
]of-- We either fail to add any dirs (probably, because all of them too big).-- Well, just report that disk must be left empty:[]-> DirPack 0[]-- Or we produce some alternative packings. Let's choose the best of them all:
packs -> maximumBy cmpSize packs
cmpSize a b =compare(pack_size a)(pack_size b)
dynamic_pack limit dirs = bestDisk limit dirs

Compile the profiling version of this code and obtain the overall
execution profile (with "+RTS -p"). You'll get something like this:

We achieved the major improvement: memory consumption is reduced by factor
of 700! Now we could test the code on the "real task" - change the
code to run the test for packing the full-sized disk:

main = quickCheck prop_dynamic_pack_is_fixpoint

Compile with profiling and run (with "+RTS -p"). If you are not lucky
and a considerably big test set would be randomly generated for your
runs, you'll have to wait. And wait even more. And more.

Go make some tea. Drink it. Read some Tolstoi (Do you have "War and
peace" handy?). Chances are that by the time you are done with
Tolstoi, program will still be running (just take my word on it, don't
check).

If you are lucky, your program will finish fast enough and leave you
with profile. According to a profile, program spends 99% of its time
inside bestDisk. Could we speed up bestDisk somehow?

Note that bestDisk performs several simple calculation for
which it must call itself. However, it is done rather inefficiently -
each time we pass to bestDisk the exact same set of
directories as it was called with, even if we have already "packed"
some of them. Let's amend this:

Verify that it is indeed so by running quickCheck for this
test several time. I feel that this concludes our knapsacking
exercises.

Adventurous readers could continue further by implementing so-called
"scaling" for dynamic_pack where we divide all directory
sizes and medium size by the size of the smallest directory to proceed
with smaller numbers (which promises faster runtimes).

We already mentioned monads quite a few times. They are described in
numerous articles and tutorial (See Chapter 400). It's hard to read a
daily dose of any Haskell mailing list and not to come across a word
"monad" a dozen times.

Since we already made quite a progress with Haskell, it's time we
revisit the monads once again. I will let the other sources teach you
theory behind the monads, overall usefulness of the concept, etc.
Instead, I will focus on providing you with examples.

Let's take a part of the real world program which involves XML
processing. We will work with XML tag attributes, which are
essentially named values:

-- Taken from 'chapter5-1.hs'type Attribute =(Name, AttValue)

'Name' is a plain string, and value could be either string or
references (also strings) to another attributes which holds the actual
value (now, this is not a valid XML thing, but for the sake of
providing a nice example, let's accept this). Word "either" suggests
that we use 'Either' datatype:

Our task is: to write a function that will look up a value of
attribute by it's name from the given list of attributes. When
attribute contains reference(s), we resolve them (looking for the
referenced attribute in the same list) and concatenate their values,
separated by semicolon. Thus, lookup of attribute "xmlns" form both
sample sets of attributes should return the same value.

Following the example set by the

Data.List.lookup

from

the standard libraries, we will call our function

lookupAttr

and it will return

Maybe Value

,

allowing for lookup errors:

-- Taken from 'chapter5-1.hs'
lookupAttr :: Name ->[Attribute]->Maybe Value
-- Since we dont have code for 'lookupAttr', but want-- to compile code already, we use the function 'undefined' to-- provide default, "always-fail-with-runtime-error" function body.
lookupAttr =undefined

Let's try to code

lookupAttr

using

lookup

in

a very straightforward way:

-- Taken from 'chapter5-1.hs'import Data.List
lookupAttr :: Name ->[Attribute]->Maybe Value
lookupAttr nm attrs =-- First, we lookup 'Maybe AttValue' by name and-- check whether we are successful:case(lookup nm attrs)of-- Pass the lookup error through.
Nothing -> Nothing
-- If given name exist, see if it is value of reference:
Just attv ->case attv of-- It's a value. Return it!
Left val -> Just val
-- It's a list of references :(-- We have to look them up, accounting for-- possible failures.-- First, we will perform lookup of all references ...
Right refs ->let vals =[ lookupAttr ref attrs | ref <- refs ]-- .. then, we will exclude lookup failures
wo_failures =filter(/=Nothing) vals
-- ... find a way to remove annoying 'Just' wrapper
stripJust (Just v)= v
-- ... use it to extract all lookup results as strings
strings =map stripJust wo_failures
in-- ... finally, combine them into single String. -- If all lookups failed, we should pass failure to caller.casenull strings of
True -> Nothing
False -> Just (concat(intersperse ":" strings))

It works, but ... It seems strange that such a boatload of code
required for quite simple task. If you examine the code closely,
you'll see that the code bloat is caused by:

the fact that after each step we check whether the error occurred

unwrapping Strings from

Maybe

and

Either

data constructors and wrapping them back.

At this point C++/Java programmers would say that since we just pass
errors upstream, all those cases could be replaced by the single "try
... catch ..." block, and they would be right. Does this mean that
Haskell programmers are reduced to using "case"s, which were already
obsolete 10 years ago?

Monads to the rescue! As you can read elsewhere (see section 400),
monads are used in advanced ways to construct computations from other
computations. Just what we need - we want to combine several simple
steps (lookup value, lookup reference, ...) into function

lookupAttr

in a way that would take into account possible

failures.

Lets start from the code and dissect in afterwards:

-- Taken from 'chapter5-2.hs'import Control.Monad
lookupAttr' nm attrs = do
-- First, we lookup 'AttValue' by name
attv <- lookup nm attrs
-- See if it is value of reference:
case attv of
-- It's a value. Return it!
Left val -> Just val
-- It's a list of references :(-- We have to look them up, accounting for-- possible failures.-- First, we will perform lookup of all references ...
Right refs ->do vals <-sequence$map(flip lookupAttr' attrs) refs
-- ... since all failures are already excluded by "monad magic",
-- ... all all 'Just's have been removed likewise,
-- ... we just combine values into single String,
-- ... and return failure if it is empty.
guard (not (null vals))
return (concat (intersperse ":" vals))

Exercise: compile the code, test that

lookupAttr

and

lookupAttr'

really behave in the same way. Try to

write a QuickCheck test for that, defining the

instance Arbitrary Name

such that arbitrary names will be taken from
names available in

simple_attrs

.

Well, back to the story. Noticed the drastic reduction in code size?
If you drop comments, the code will occupy mere 7 lines instead of 13
- almost two-fold reduction. How we achieved this?

First, notice that we never ever check whether some computation

returns

Nothing

anymore. Yet, try to lookup some
non-existing attribute name, and

lookupAttr'

will return

Nothing

. How does this happen? Secret lies in the fact
that type constructor

Maybe

is a "monad".
We use keyword

do

to indicate that following block of

code is a sequence of monadic actions, where monadic magic
have to happen when we use '<-', 'return' or move from one action to
another.

Different monads have different magic. Library code says that

type constructor

Maybe

is such a monad that we could use

<-

to "extract" values from wrapper

Just

and
use

return

to put them back in form of

Just some_value

. When we move from one action in the "do" block to
another a check happens. If the action returned