Latest revision as of 11:45, 7 February 2013

In a language like Haskell, where Lists are defined as

Nil | Cons a (List a)

, creating data structures like cyclic or doubly linked lists seems impossible. However, this is not the case: laziness allows for such definitions, and the procedure of doing so is called tying the knot. The simplest example:

cyclic =let x =0 : y
y =1 : x
in x

This creates the cyclic list consisting of 0 and 1. It is important to stress that this procedure allocates only two numbers - 0 and 1 - in memory, making this a truly cyclic list.

The knot analogy stems from the fact that we produce two open-ended objects, and then link their ends together. Evaluation of the above therefore looks something like

Contents

This example illustrates different ways to define recursive data structures.
To demonstrate the different techniques we show how to solve the same problem---writing an interpreter for a simple programming language---in three different ways. This is a nice example because, (i) it is interesting, (ii) the abstract syntax of the language contains mutually recursive structures, and (iii) the interpreter illustrates how to work with the recursive structures.

At first, the lack of pointer-updating operations in Haskell makes it seem that building cyclic structures (circular lists, graphs, etc) is impossible. This is not the case, thanks to the ability to define data using recursive equations. Here is a little trick called `tying the knot'.

Remember that Haskell is a lazy language. A consequence of this is that while you are building the node, you can set the children to the final values straight away, even though you don't know them yet! It twists your brain a bit the first few times you do it, but it works fine.

Here's an example. Say you want to build a circular, doubly-linked list, given a standard Haskell list as input. The back pointers are easy, but what about the forward ones?

(takeF and takeR are simply to let you look at the results of mkDList: they take a specified number of elements, either forward or backward).

The trickery takes place in `go'. `go' builds a segment of the list, given a pointer to the node off to the left of the segment and off to the right. Look at the second case of `go'. We build the first node of the segment, using the given prev pointer for the left link, and the node pointer we are *about* to compute in the next step for the right link.

This goes on right the way through the segment. But how do we manage to create a *circular* list this way? How can we know right at the beginning what the pointer to the end of the list will be?

Take a look at mkDList. Here, we simply take the (first,last) pointers we get from `go', and *pass them back in* as the next and prev pointers respectively, thus tying the knot. This all works because of lazy evaluation.

Somehow the following seems more straightforward to me, though perhaps I'm missing the point here:

The problem with this is it won't make a truly cyclic data structure, rather it will constantly be generating the rest of the list. To see this use trace (in Debug.Trace for GHC) in mkDList (e.g. mkDList xs = trace "mkDList" $ ...) and then takeF 10 (mkDList "a"). Add a trace to mkDList or go or wherever you like in the other version and note the difference.

-- DerekElkins

Yeah, thanks, I see what you mean.

This is so amazing that everybody should have seen it, so here's the trace. I put trace "\n--go{1/2}--" $ directly after the two go definitions:

The above works for simple cases, but sometimes you need construct some very complex data structures, where the pattern of recursion is not known at compile time. If this is the case, may need to use an auxiliary dictionary data structure to help you tie your knots.

Consider, for example, how you would implement deterministic finite automata (DFAs). One possibility is:

That is, a DFA is a set of states, one of which is distinguished as being the "start state". Each state has a number of transitions which lead to other states, as well as a flag which specifies or not the state is final.

This representation is fine for manipulation, but it's not as suitable for actually executing the DFA as it could be because we need to "look up" a state every time we make a transition. There are relatively cheap ways to implement this indirection, of course, but ideally we shouldn't have to pay much for it.

(Note: We're only optimising state lookup here, not deciding which transition to take. As an exercise, consider how you might optimise transitions. You may wish to use RunTimeCompilation.)

Turning the indirect recursion into direct recursion requires tying knots, and it's not immediately obvious how to do this by holding onto lazy pointers, because any state can potentially point to any other state (or, indeed, every other state).

What we can do is introduce a dictionary data structure to hold the (lazily evaluated) new states, then introducing a recursive reference can be done with a a simple dictionary lookup. In principle, you could use any dictionary data structure (e.g. Map). However, in this case, the state numbers are dense integers, so it's probably easiest to use an array:

Note how similar this is to the technique of MemoisingCafs. In fact what we've done here is "memoised" the data structure, using something like HashConsing.

As noted previously, pretty much any dictionary data structure will do. Often you can even use structures with slow lookup (e.g. association lists). This is because fully lazy evaluation ensures that you only pay for each lookup the first time you use it; subsequent uses of the same part of the data structure are effectively free.

-- AndrewBromage

Transformations of cyclic graphs and the Credit Card Transform

Cycles certainly make it difficult to transform graphs in a pure non-strict language. Cycles in a source graph require us to devise a way to mark traversed nodes -- however we cannot mutate nodes and cannot even compare nodes with a generic ('derived') equality operator. Cycles in a destination graph require us to keep track of the already constructed nodes so we can complete a cycle. An obvious solution is to use a state monad and IORefs. There is also a monad-less solution, which is less obvious: seemingly we cannot add a node to the dictionary of already constructed nodes until we have built the node. This fact means that we cannot use the updated dictionary when building the descendants of the node -- which need the updated dictionary to link back. The problem can be overcome however with a credit card transform (a.k.a. "buy now, pay later" transform). To avoid hitting the bottom, we just have to "pay" by the "due date".

For illustration, we will consider the problem of printing out a non-deterministic finite automaton (NFA) and transforming it into a deterministic finite automaton (DFA). Both NFA and DFA are represented as cyclic graphs. The problem has been discussed on the Haskell/Haskell-Cafe mailing lists. The automata in question were to recognize strings over a binary alphabet.

whose fields have the obvious meaning. Label is used for printing out and comparing states. The flag acceptQ tells if the state is final. Since an FaState can generally represent a non-deterministic automaton, transitions are the lists of states.

An automaton is then a list of starting states.

type FinAu l =[FaState l]

For example, an automaton equivalent to the regular expression "0*(0(0+1)*)*" could be defined as:

Printing a FaState however poses a slight problem. For example, the state labeled '1' in the automaton dom18 refers to itself. If we blindly 'follow the links', we will loop forever. Therefore, we must keep track of the already printed states. We need a data structure for such an occurrence check, with the following obvious operations:

In this article, we realize such a data structure as a list. In the future, we can pull in something fancier from the Edison collection:

instance OCC []where
empty =[]
seenp =elem
put =(:)

We are now ready to print an automaton. To be more precise, we traverse the corresponding graph depth-first, pre-order, and keep track of the already printed states. A 'states_seen' datum accumulates the shown states, so we can be sure we print each state only once and thus avoid the looping.

The acceptance function for our automata can be written as follows. The function takes the list of starting states and the string over the boolean alphabet. The function returns True if the string is accepted.

We are now ready to write the NFA->DFA conversion, a determinization of an NFA. We implement the textbook algorithm of tracing set of NFA states. A state in the resulting DFA corresponds to a list of the NFA states. A DFA is generally a cyclic graph, often with cycles of length 1 (self-referenced nodes). To be able to "link back" as we build DFA states, we have to remember the already constructed states. We need a data structure, a dictionary of states:

At the heart of the credit card transform is the phrase from the above code:

converted_states1 =(dfa_label,dfa_state) `putd` converted_states

The phrase expresses the addition to the dictionary of the 'converted_states' of a 'dfa_state' that we haven't built yet. The computation of the 'dfa_state' is written 4 lines below the phrase in question. Because (,) is non-strict in its arguments and locate is non-strict in its result, we can get away with a mere promise to "pay". Note that the computation of the dfa_state needs t0' and t1', which in turn rely on 'converted_states1'. This fact shows that we can tie the knot by making a promise to compute a state, add this promise to the dictionary of the built states, and use the updated dictionary to build the descendants. Because Haskell is a non-strict language, we don't need to do anything special to make the promise. Every computation is Haskell is by default a promise.

For a long time, I've had an issue with Oleg's reply to Hal Daume III, the "forward links" example. The problem is that it doesn't really exploit laziness or circular values. It's solution would work even in a strict language. It's simply a functional version of the standard approach: build the result with markers and patch it up afterwards. It is a fairly clever way of doing purely something that is typically done with references and mutable update, but it doesn't really address what Hal Daume III was after. Fixing Hal Daume's example so that it won't loop is relatively trivial; simply change the case to a let or equivalently use a lazy pattern match in the case. However, if that's all there was to it, I would've written this a long time ago.

The problem is that it no longer gives you control of the error message or anyway to recover from it. With GHC's extensions to exception handling you could do it, but you'd have to put readDecisionTree in the IO monad to recover from it, and if you wanted better messages you'd have to put most of the parsing in the IO monad so that you could catch the error earlier and provide more information then rethrow. What's kept me is that I couldn't figure out a way to tie the knot when the environment had a type like, Either String [(String,DecisionTree)]. This is because it's impossible for this case; we decide whether to return Left "could not find subtree" or Right someValue and therefore whether the environment is Left or Right based on whether we could find the subtree in the environment. In effect, we need to lookup a value in an environment we may return to know whether to return it. Obviously this is a truly circular dependency. This made me think that Oleg's solution was as good as any other and better than some (actually, ironically Oleg's solution also uses a let instead of a case, however, there's nothing stopping it from being a case, but it still would provide no way to recover from it without effectively doing what is mentioned below). Recently, I've thought about this again and the solution is obvious and follows directly from the original definition modified to use let. It doesn't loop because only particular values in the lookup table fail, in fact, you might never know there was a variable lookup error if you didn't touch all of the tree. This translates directly into the environment having type [(String,Either String DecisionTree)]. There are several benefits to this approach compared to Oleg's: 1) it solves my original problem, you are now able to specify the error messages (Oleg's can do this), 2) it goes beyond that (and beyond Hal Daume's original "specification") and also allows you to recover from an error without resorting to the IO monad and/or extensions (Oleg's can't do this), 3) it does implicitly what Oleg's version does explicitly, 4) because of (3) it shares properly while Oleg's does not*, 5) both the environment and the returned value are made up of showable values, not opaque functions, 6) it requires less changes to the original code and is more localized than Oleg's solution; only the variable lookup and top-level function will need to change.

To recover, all one needs to do is make sure all the values in the lookup table are Right values. If they aren't, there are various ways you could collect the information; there are also variations on how to combine error information and what to provide. Even without a correctness check, you can still provide better error messages for the erroneous thunks. A possible variation that loses some of the benefits, is to change the DecisionTree type (or have a different version, IndirectComposite comes to mind here) that has Either ErrorInfo ErrorDecisionTree subnodes, which will allow you to recover at any time (though, if you want to make a normal DecisionTree out of it you will lose sharing). Also, the circular dependency only comes up if you need to use the environment to decide on an error. For example, a plain old syntactic parse error can cyclicly use an Either ErrorInfo [(String,DecisionTree)] perfectly fine (pass in fromRight env where fromRight ~(Right x) = x). It will also work even with the above approach giving the environment the type Either [(String,Either ErrorInfo DecisionTree)]. Below is code for a simplified scenario that does most of these things,

checkedFixup demonstrates how you could check and recover, but since the environment is the return value neither fixup or checkedFixup quite illustrate having potentially erroneous thunks in the actual return value. Some example input,

Consider a tree Y that contains the subtree X twice. With Oleg's version, when we resolve the "X" variable we look up a (manually) delayed tree and then build X. Each subtree of Y will build it's own version of X. With the truly circular version each subtree of Y will be the same, possibly erroneous, thunk that builds X, if the thunk isn't erroneous then when it is updated both of Y's subtrees will point to the same X.