Pure functional, mutation-free, efficient double-linked lists

It is always an interesting challenge to write a
pure functional and efficient implementation of an imperative
algorithm destructively operating on a data
structure. The functional implementation has a significant benefit
of equational reasoning and modularity. We can comprehend
the algorithm without keeping the implicit global state in mind. The
mutation-free, functional realization also has practical benefits:
the ease of adding checkpointing, undo and redo. The absence of
mutations makes the code multi-threading-safe and helps in porting
to distributed or non-shared-memory parallel
architectures. On the other hand, an imperative implementation has the
advantage of optimality: mutating a component in a complex data
structure is a constant-time operation, at least on conventional
architectures. Imperative code makes sharing explicit, and so permits
efficient implementation of cyclic data structures.

We show a simple example of achieving all the benefits of
an imperative data structure -- including sharing and the efficiency of updates
-- in a pure functional program. Our data structure is a doubly-linked,
possibly cyclic list, with the standard operations of adding, deleting and
updating elements; traversing the list in both directions;
iterating over the list, with cycle detection. The code:

uniformly handles both cyclic and terminated lists;

does not rebuild the whole list on updates;

updates the value in the current node in time bound by a small constant;

does not use or mention any monads;

does not use any IORef, STRef, TVars, or any other destructive updates;

permits logging, undoing and redoing of updates, checkpointing;

easily generalizes to two-dimensional meshes.

The algorithm is essentially imperative -- hence supporting the
identity check and the in-place `update' -- but implemented purely
functionally. Although the code relies on many local, type safe
`heaps', there is emphatically no global heap and no global state.

It is not for nothing that Haskell has been called the best
imperative language. Imperative algorithms are implementable as
they are -- yet genuinely functionally, without resorting to the monadic
sub-language but taking the full advantage of clausal definitions,
pattern guards, laziness.

Total stream processors and their applications to all infinite streams

In the article on seemingly impossible functional programs,
Marti'n Escardo' wrote about decidable checking of the
satisfaction of a total computable predicate on Cantor numerals. The
latter represent infinite bit strings, or all real numbers
within [0,1]. Mart'n Escardo's technique can tell, in finite
time, if a given total computable predicate is satisfied
over all possible infinite bit strings. Furthermore, for
so-called sparse predicates, Marti'n Escardo's technique is very
fast.

We re-formulate the problem in terms of streams and
depth-limited depth-first search, and thus cast off the mystery of
deciding the satisfiability of a total computable predicate over the
set of all Cantor numerals, which are uncountable.

As an additional contribution, we show how to write
functions over Cantor numerals in a `natural' monadic style so that
those functions become self-partially evaluating. The instantiation of
the functions in an appropriate pure monad gives
us transparent memoization, without any changes to the functions
themselves. The monad in question is pure and involves no reference
cells.

On `dense' functions on numerals (i.e., those that need to examine
most of the bits of their argument, up to a limit), our technique
performs about 9 times faster than the most sophisticated
one by Marti'n Escardo'.

Left fold vs. right fold: theory vs. practice

The difference between theory and practice is a butt of many jokes,
and, hence, the indication of a profound difficulty. How can we apply
a theoretical insight -- say, that the left fold can be written in
terms of the right fold -- to the practical question of designing data
structure traversal libraries. Should we take a stand, as one comes
across, that the left fold, as theoretically redundant, has no place
in a well-designed library? Applying theoretical insights (from moral
and political theory) to messy reality has also worried Kant, among
other people. In his 1793 essay on theory and practice, Kant
emphasized that theoretical rules ``are thought of as principles
possessing a certain generality and, consequently, as being abstracted
from a multitude of conditions that nonetheless necessarily influence
their application.'' Judgment is therefore required to discern when
the abstraction is justified and hence the rule is applicable, and
when it is not. This article will expound this point, reminding
that the theory that makes the left fold redundant abstracts away
memory consumption and (often significant) constant factors in the
running time. Running out of memory and k-times slowdowns do make us
worry. Therefore, in practice the left fold is not redundant
and should be provided (in a form with early termination) alongside the
right fold in a data traversal library. And yet, the abstract theory
is not at all practically futile: it guides program transformations
and reasoning. These transformations, implemented in GHC,
do make Prelude.foldr subsume the left fold even in practice.

is one of the most common and fundamental patterns of processing a
data structure. In fact, a data structure can be faithfully
represented by -- is isomorphic to -- its right fold. For example, the
entire list processing library, from head and tail to drop and take and even zipWith can be written entirely in terms of foldr.

Therefore,
providing both folds within the same library seems redundant.
Let us however take a second look at foldl_via_foldr: the foldr in that expression is a so-called `second order fold',
building a closure which is then applied to z. To appreciate
the costs, let us consider the evaluation of a simple example foldl_via_foldr f z [e1,e2,e3] for some f, z and the three list
elements, in call-by-name/need and call-by-value.

In call-by-name (call-by-need is the same), our example evaluates
as follows:

a thunk whose structure mirrors the original list [e1,e2,e3]. We have
essentially duplicated the list -- worse than duplicated since a thunk
(closure) typically takes more space than a cons cell. We are not done
yet: we have to reduce the thunk to the weak-head-normal form. If f is
strict in its left argument (such as addition), we have to reduce (f z e1), then (f (f z e1) e2), etc. -- effectively traversing the
original list once again, this time to compute the result of f
reductions.

For comparison we evaluate the same example with the native left fold,
actually, using its strict version foldl':

No closures or thunks are constructed; we first compute f z e1 as v1, then f v1 e2, etc. -- reducing the list element-by-element in
constant space. In contrast, foldl_via_foldr had built two sets of
thunks (whose size is proportional to the length of the list) and
effectively traversed the list twice. One can express foldl'
via foldr by adding a strictness annotation to foldl_via_foldr.
The applications like f z e1 are then reduced immediately; the thunks foldr (\e a z -> a (f z e)) id [e2,e3], etc. still have to be
built. The benchmarks in the accompanying code bear the conclusions
out. The strict left fold foldl' is the fastest and the most
space-economical way to reduce a list with a strict function: when
interpreting the Haskell code and when compiling with all
optimizations. In the latter case, foldl' reductions are more than
the order-of-magnitude faster than foldr and foldl_via_foldr
reductions and three times as faster then foldl'_via_foldr. Furthermore, summing an integer list with foldl'
runs truly in constant space, with no allocations. In contrast,
emulating foldl or foldl' via foldr looks, from the amount of
allocated memory, like duplicating the original list. (See however
below for Prelude.foldr.) The case for providing foldl' as
a primitive in a data traversal library is compelling.

The case for the left fold is even more compelling in strict
languages. Our example of reducing a three-element list runs under
call-by-value thusly:

At this point the stack looks quite like the original list itself,
with list elements in closures rather than cons cells. We have
in effect copied the whole list onto stack. For long lists, stack overflow
is imminent. Now we have to unwind the stack:

Unwinding the stack has built the closures f3 and f23. The list got
copied again, from the stack to the heap, as closures. The application (\a z -> a (f z e1)) can proceed, and the list is traversed once more,
this time computing the f reductions:

The total cost: copying the list onto the stack and then to the heap,
requiring O(N) stack and heap space, and effectively three
traversals. On the other hand, the primitive left fold accomplishes
the job in constant space, with a single pass through the list. Thus
emulating left fold via right fold is a bad idea in practice.

An interesting variation is folding with a non-strict function. For example,
indexing of a list

An argument similar to the one for the left fold may convince us that
writing index via foldr is not good in practice, due to the cost
of closure construction and extra traversal. Writing index in terms
of foldl is not a good idea either since the left fold always
traverses the list through the very end. Query functions like index
should stop scanning the data structure as soon as the desired element
is found. Therefore, we need something like foldl with early
termination -- which is what iteratees are. In strict languages, whose
folds are generally effectful, that is perhaps the only choice.
However, in Haskell there is another, surprising choice.

We have seen that foldl_via_foldr is not practically viable and
hence foldl' is better be provided by a data processing library.
And yet the theory that neglects memory consumption is not
practically futile. In GHC, compiling foldl_via_foldr using
specifically Prelude's foldr and full optimizations produces the
same code as the hand-written foldl with the explicit recursion.
GHC, implementing theoretically justified transformations, can
afford to drop foldl and foldl' from the standard library without
any loss of performance.

We have to, however, re-write foldr as follows, which is how
it is programmed in the standard Prelude:

We have applied the equivalence-preserving transformation called
`lambda-dropping', the opposite of `lambda-lifting'. The correctness
of lambda-dropping is shown theoretically. Lambda-lifting is pervasive
in compiling higher-order languages; lambda-dropping is of limited
applicability and so is not done automatically. GHC authors had to
write foldr_Prelude by hand.

The next step are two beta-reductions, the second of which
substitutes go xs. Generally substituting an expression
is only sound in call-by-name/need, or when we can prove the expression is
effect-free and terminating. In this particular case, the substitution is
clearly always valid. (It is not clear how smart a call-by-value
compiler has to be to see that.) The result:

is literally the left fold. A look at the Core code produced by
GHC shows that foldl_via_foldr and foldl compile to the same
code. After inlining and code substitutions GHC has derived the left
fold. Likewise, GHC derives the optimal, hand-written index from index_foldr.

Applying theory to practice is indeed a messy business. One has to
know the assumptions of the theory and use one's judgment. The
theoretical expressivity of the left fold in terms of the right fold
does not usually apply to practice, where we care about running out of
memory and k-times slowdowns. Therefore, the left fold, albeit
theoretically redundant, should be provided in a data structure
library along with the right fold, especially in call-by-value
languages. In the latter case, the left fold with an early termination
makes for a better primitive.

We have also seen that a theory that abstract away memory consumption
and constant factors is not at all practically useless. The
abstraction uncovers the essentials of foldr and help derive very
powerful optimizations, which do make, in some cases, the left fold
redundant after all, even in practice. It seems befitting to end with
two quotes from Kant: ``[T]he general rule must be supplemented by an
act of judgment whereby the practitioner distinguishes instances
where the rule applies from those where it does not. ... No-one can
pretend to be practically versed in a branch of knowledge and yet
treat theory with scorn, without exposing the fact that he is an
ignoramus in his subject.''

Version

The current version is May 2014.

References

Folds.hs [9K]
Complete code for the article, with detailed derivations and a few benchmarks

Immanuel Kant: Theory and Practice
(The full title, literally from German: ``On the Proverb: That May be True in Theory but Is of No Practical Use'')
A few salient quotes

An aggregation of rules, even of practical rules, is called a theory,
as long as these rules are thought of as principles possessing a
certain generality and, consequently, as being abstracted from a
multitude of conditions that nonetheless necessarily influence their
application. Conversely, not every undertaking [Hantierung] is a
practice [Praxis]; rather, only such ends as are thought of as being
brought about in consequence of certain generally conceived
[vorgestellten] principles of procedure [Verfahrens] are designated
practices.

For to the concept of the understanding that contains the rule must be
added an act of judgment by means of which the practitioner decides
whether or not something is an instance of the rule. And since
further rules cannot always be added to guide judgment in its
subsumptions (for that could go on infinitely), there can be
theoreticians who, lacking judgment, can never be practical in their
lives.

Reflection without Remorse: Revealing a hidden sequence to speed up monadic reflection

[Abstract]
A series of list appends or monadic binds for many monads performs
algorithmically worse when left-associated. Continuation-passing style
(CPS) is well-known to cure this severe dependence of performance on
the association pattern. The advantage of CPS dwindles or disappears
if we have to examine or modify the intermediate result of that series
of appends or binds, before continuing the series. Such examination is
frequently needed, for example, to control search in non-determinism
monads.

We present an alternative approach that is just as general as CPS but
more robust: it makes series of binds and other such operations
efficient regardless of the association pattern-- and also provides
efficient access to intermediate results. The key is to represent such
a conceptual sequence as an efficient sequence data structure.
Efficient sequence data structures from the literature are homogeneous
and cannot be applied as they are in a type-safe way to series of
monadic binds. We generalize them to type aligned sequences and
show how to construct their (assuredly order-preserving)
implementations. We demonstrate that our solution solves previously
undocumented, severe performance problems in iteratees, LogicT
transformers, free monads and extensible effects.

Joint work with Atze van der Ploeg.

References

zseq.pdf [277K]
Paper published in Proceedings of the Haskell Symposium 2014. September 4-5, 2014. Gothenburg, Sweden.

Generating interesting lambda-terms

It is straightforward to generate lambda-terms, with the help of the
simplest non-deterministic operations for choice and failure that are
available in many language systems. With a little
more care one can ensure that any lambda-term occur in the generated
stream sooner or later. It takes remarkably more care to make
interesting terms come sooner than much, much later. This article
gives a few tips on how to relatively quickly
generate interesting terms such the S and Y
combinators, the divergent omega-term, a Church numeral and its successor.

We limit attention to closed terms since all interesting terms
mentioned earlier are closed. De Bruijn indices let us avoid wasting
time on building alpha-equivalent terms. (The terms will be
pretty-printed in the conventional, named-variable notation, for
clarity.) Thus we will be generating terms described by the
following grammar (data type declaration), with the side condition
that the terms are closed.

data Exp = V Int | A Exp Exp | L Exp

The generator is a truly straightforward application of the simplest
non-deterministic operations in the MonadPlus interface.

Absent constants, a closed lambda-term must start with a lambda-binder.
The rest could be either a variable, an application of two non-deterministically
chosen terms, or an abstraction with a random body.

The equally straightforward implementation of MonadPlus, the List monad,
fails to generate anything interesting within the first 10 000 terms. It is
not difficult to see why. Here are the first 8 generated terms:

The List monad realizes an incomplete search strategy: there
is no guarantee that any given term will ever come. No applications are
indeed forthcoming with the List monad. A better implementation of
non-determinism is needed, with
a complete search strategy: for example, iterative deepening or FBackTrack.
With iterative deepening (which in our implementation produces
the same sequence as breadth-first search but without taking Gbytes of
memory), the first 10 generated terms

look quite hopeful. Within the first 10 000 generated terms,
the self-application Lx.x x, or delta, comes 4th; the third Church
numeral comes 716th and omega comes 3344th. Alas, there is no
successor or the S combinator, let alone the Y combinator. With
the FBackTrack implementation, the first few terms may look even more hopeful

The term delta comes the second, the third Church numeral comes
695th. Alas, within the first 99 977 terms nothing else interesting
comes. Generating interesting terms is not at all as straightforward
as it seems. Even sophisticated MonadPlus implementations did not
help.

It is crucial to recognize that the straightforward search for lambda-terms
is biased, and that bias is not in favor of interesting terms. Generally,
the fewer
non-deterministic choices it takes to produce a term, the sooner the term
comes. The straightforward generator clearly
favors selectors (terms of the form Lx.Ly...Lu.x) and simple
applications such as Lx. (x x) (x x). We should put the brake on
generating abstractions: each new L adds a new variable and hence
dramatically many more possible terms. We should encourage
the generator to play more with the variables it already has.
To prevent the string of applications of a variable to itself,
we should produce terms in the general order of their size. Again the goal
is to explore the search space more uniformly. It is tempting to generate
only normal forms. Alas, the Y combinator and omega do not have normal
forms. Therefore, we do generate redices, but only of interesting kinds
(whose argument is an abstraction), and only occasionally. The following code
incorporates these ideas:

The auxiliary iota i produces an integer greater or equal to i; range i j chooses an integer between i and j, inclusive. The
operation yield

-- Lowers the priority of m, so choices of m will be tried less often
yield :: MonadPlus m => m a -> m a
yield m = mzero `mplus` m

and its n-th iterate penalty n produce a string of failures before trying
the argument computation m. When a complete search strategy sees
many failures it tends to turn away and to pay more attention to other parts
of the search space.

The term delta comes fourth, omega comes 54th, the Y combinator 303d,
the third Church numeral 393d, the S combinator 1763d and the successor
4865th.

The conclusion, although obvious in hindsight, is still
thought-provoking: interesting lambda-terms are really hard to
encounter by accident. They are exquisitely rare in the space of
possible lambda-terms and distributed non-uniformly. A monkey banging
on even the sophisticated lambda-typewriter may have printed the Y
combinator, but would unlikely to print even the addition combinator
within its lifetime.

References

EnumFix.hs [12K]
The complete code for the simple and sophisticated generators, as part of the procedure to search for fixpoint combinators

Efficient integer logarithm of large numbers in any base

The integer logarithm of the number n in base b is the integer l
such that b^l <= n < b^(l+1). The number one greater than the integer
logarithm base b is the size of n, that is, the number of digits,
in its base-b representation. We present
the Haskell98 code that is just as efficient as the internal GHC.Integer.Logarithms.integerLogBase# function but uses no unboxed
data, no optimizations, and is not even compiled.

Naively, the integer logarithm is computed by repeated divisions of n by b, until the result reaches 1. This procedure requires l
divisions, where l is the logarithm. We present two efficient algorithms:
the first uses only multiplications, no more than 2*log_2 l of them, and (log_2 l) * sizeof(n) extra memory for the powers of b. The second
algorithm does log_2 l multiplications and no more than log_2 l
integer divisions (and the same amount of extra memory) to compute l.

Here is the multiplication-only procedure that returns l+1 where l
is the integer logarithm of n in base b. It is a composition
of two functions, which together compute the bits of l: major_bit determines
the upper bound and other_bits improves it.

The correctness and the complexity analysis follow
from the invariants of the two auxiliary functions. In any call major_bit bases, the list bases is [BaseP b^k k | j <- [d,d-1..0], k=2^j] for some d and n > b^(2^d). In the first
invocation, d is 0 and progressively increases until such d>0 is
found that b^(2^(d-1)) < n <= b^(2^d). In any call other_bits
(BaseP bi i) bases, the list bases is [BaseP b^k k | j <- [d,d-1..0], k=2^j] for some d, k=2^d and b^i < n <
b^(i+2k). Each invocation of other_bits halfs k until it reaches
one. The other, more efficient as it turns out, algorithm modifies other_bits to divide n by the candidate lower bound b^i.
That integer logarithm function computes that the 48th Mersenne prime 2^57885161-1 has 17425170 digits in its decimal representation --
in 1.2 seconds.

Fast computation of Bernoulli numbers

This Haskell98 code quickly computes Bernoulli numbers. The
code avoids explicit recursion, explicit factorials and (most)
computing with rationals, demonstrating stream-wise processing and CAF
memoization. For example, the following snippet defines a 2D table of
pre-computed powers r^n for all r>=2 and n>1.
Thanks to lazy evaluation, the table is automatically sized as needed.
There is no need to guess the maximal size of the table so to allocate it.

powers = [2..] : map (zipWith (*) (head powers)) powers

The rest of the algorithm exhibits similar stream-wise
processing and computations of tables in terms of themselves.

Iterated zipping puzzles

The following is the refined code for the challenge
originally posted by Joe English on a Haskell-Cafe thread about
homework-like puzzles. The challenge was to figure out what the code
does without first loading it up in a Haskell interpreter.