The three modes defined for each predicate allow us to check if a given move or jump is valid for the given player, to enumerate the moves or jumps available to a piece at a given position, and to enumerate all moves and jumps available on the board, respectively. Implementation is straightforward:

Since in checkers, a piece must jump if it can jump, we will later find it useful to check whether a piece can jump from a given position without the overhead of actually updating the board. Checking whether a piece can move will also be useful for determining the end of game, so let’s define both can_move and can_jump:

game module

The game module will house the state of the game, and logic for generating possible successor states based on the rules of the game. The state of the game is given by whose turn it is, whether they are in the middle of a jump or not (in checkers, after jumping a piece, you may immediately jump another with the same piece), the history of moves (to check for draws) and the state of the board:

(The init predicate is redundant but some programmers prefer that style over zero-argument functions so we’ll provide it.)

Rules of the game

The change from one game state to the next is uniquely identified by the starting and ending locations of some piece on the board. We can declare a predicate which updates the game state based on this information, or generates possible new game states, as follows:

The definition of step itself is slightly hairy, as it must follow the rules of checkers. I’ve broken it into two cases, corresponding to when a player begins their move (jumping_state is not_jumping), and when a player is forced to continue a move because a further jump is available (jumping_state is jumping):

(is_draw is defined below.) Phew! That, fortunately, is the single most complex chunk of code in the entire implementation.

Checking the state of the game

It is important to provide predicates to determine whether a game has finished, and if so, who (if anyone) has won. First, we define can_step as a proxy for step which does not actually create a new game state:

These predicates are all fairly straightforward, so I’ll elide most of them here to save space. Of note is choose_move, which presents each game state in the list in turn to the player, asking if the player wishes to choose that move. If the player chooses none, the sequence is repeated. Note that the mode of this argument is given as in(non_empty_list) to ensure that an empty list of moves is not passed, which would result in an infinite loop. Here is the code, which is mostly implemented by choose_move_aux:

Players

This means that a player is any type Player which has defined for it a make_move predicate with the given signature. This allows us to implement various “players”, each with a different choose_move predicate.

By including the IO state and declaring the determinism of make_move as cc_multi, we allow both for players which must interact with IO (e.g., humans), and for players which may choose a move arbitrarily (e.g., search algorithms which find several optimal choices).

Now we can implement various human and AI players which conform to this interface. Since we’re just testing for now, we’ll define a human player like so:

(The dummy type human_player is needed so we can declare it as an instance of the typeclass player.) unsorted_solutions returns an arbitrarily ordered list of all the possible moves from a given position. We assert that this list should not be empty (as the player should not have been asked to make a move when there are no moves available, as this means that the game is over!), and forward the rest of the work to our previously-defined game_io.choose_move.

We can also define a dummy AI that chooses a move arbitrarily like so:

Why typeclasses?

You may rightly ask, “why not simply declare a type player ---> human_player; dummy_player and forget the whole typeclass business?” By using typeclasses, we can provide new player implementations in other modules, rather than stuffing them all into one possibly large module.

Of course, another way of doing this would be using higher-order predicates: we could declare players as a predicate type. The problem with this approach is that we may at some point determine that a player should have more predicates associated with it (such as one defining their name or difficulty), at which point we will need to pass around multiple predicates having messy types and instantiations. Using typeclasses allows us to redefine what a player is more easily.

The main loop

Last but not least, we will define a loop play_game, which takes a game state and two values whose types are instances of the player typeclass to represent the two players. The implementation is straightforward:

Hooray! Of course, now it’s our move again. This could go on for a while. Testing it proves that everything works however!

Conclusion

Today we covered building the game logic in the moves and game module, and I/O and the main loop in the game_io and checkers modules. Next week we’ll finish up by adding an AI based on the minimax algorithm.

Tuesday, November 29, 2011

Logic languages such as Mercury love search problems. Game artifical intelligence is a great application of searches. AIs for many zero-sum two-player board games, such as checkers and chess, can be implemented as a minimax search; a search which assumes that each player makes a move which maximally benefits himself. In today’s post, I’d like to walk through modeling checkers in Mercury.

Architecture

The overall architecture of our checkers game will consist of the following modules:

color

The two colors (players) in the game and their relation (opponents).

piece

The four kinds of pieces in the game and functions over them (namely, kinging).

position

A representation of the position of a piece, and predicates defining useful relationships between positions (namely, moving and jumping).

board

A representation of the checkers board, predicates about positions on it, predicates about pieces on it, and predicates to modify a board in useful ways (namely, moving, removing, and kinging pieces).

moves

Predicates to update a board via the patterns described in position.

game

An encoding of the state of a game, a predicate to modify it (by making a move according to the rules of checkers), and predicates to query the state of a game (namely, whether the game has ended, and the result if so).

util

Some utility functions for dealing with integers and lists; used primarily by minimax.

minimax

The minimax algorithm and related typeclasses.

heuristic

A declaration of the game as a minimax problem, with associated heuristics.

game_io

Predicates for printing various game structures in a human-readable format, and obtaining user input.

checkers

The main module, allowing configuration of the game and defining the overall UI.

This architecture ensures that modules are loosely coupled and small in size (< 120 loc each, average ~ 60 loc), making modifications and debugging easy. A similar overall architecture should prove useful for other game AI implementations.
Let’s go into each module in detail. In today’s post, we’ll focus on color, piece, position, and board.

position module

A position is simply an X, Y tuple:

:- type position ---> position(int, int).

It is useful to talk about pieces changing position by moving and jumping. Let’s declare two predicates, move and jump, which will define the relationship between the starting and ending locations of a piece which moves or jumps, and in the case of a jump, the position which was jumped over.

It will be most useful to know whether a particular change in position describes a move or jump, and given a starting position, what ending positions would result from a move or jump. We will declare two modes, corresponding to these use cases, for each predicate:

Unfortunately any attempt to compact this code results in mode errors. This is because Mercury’s switch detection code is not very robust in the face of complicated switches, and because Mercury seems not to reorder conjunctions to obtain mode-correctness.

board module

First we define the type of a board, which should contain its width, height, a mapping from position to piece, and (for efficiency) a count (using the handy bag module) of how many of each type of piece is on the board:

Of course we could fix the width and height to 8, but it is easy and interesting to allow for other size boards. Since Mercury automatically generates “getter” functions for width and height, we need only declare them in our interface, even if we choose to hide the type of board:

:- func width(board) = int.
:- func height(board) = int.

Another trivial but important state accessor is count_pieces, which counts the number of a given kind of piece on the board:

Note that we call legal_position with a partially instantiated position, position(_, Y), which is also bound to Pos. This means that we need a mode of legal_position which can fill in valid X coordinates of a given row.

All those modes! Beside the modes where the position argument is in and out (corresponding to checking a legal position and generating all legal positions), the other two modes work with partially instantiated positions, where we wish to know the legal positions in a given column or row (respectively). Thankfully, all four modes can be implemented with one body:

Unfortunately the predicate nondet_int_in_range in Mercury’s int module does not provide a semidet mode and is thus unsuitable for bidirectional programming. Instead, we declare a similar predicate, int_in_range, in our util module:

Because we know that these two clauses are logically equivalent, we promise such to Mercury using promise_equivalent_clauses.

Locating pieces

To query which pieces are at which positions, we declare piece_at. piece_at has two modes, one for checking which piece (if any) is at a given position, and one for iterating over all pieces currently on the board:

Our implementation of piece_at should be a thin layer over the positions map, but Mercury’s map module does not provide one predicate which serves both purposes. Again, we can remedy this by using different clauses for each mode:

Note that no_piece_at(…) is not simply not piece_at(…)! By additionally requiring that Position be a legal position, we restrict the domain of no_piece_at and thus allow it to output the finite number of positions where no piece exists (as opposed to the infinite number of positions which either have no piece or are not legal).

Modifying the board

There are three predicates in board to modify the board: move_piece, remove_piece, and king_piece. move_piece and remove_piece have modes which allow for both modifying a given location(s), and for generating all possible such modifications of the board. king_piece has only one mode to modify a given location. All three have similar code utilizing state variablefield update syntax. Here is the implementation of king_piece as an example:

It is my opinion that Mercury would benefit from an extension to state variable update syntax which allows state variable record syntax for data terms. Such a syntax would, in concert with the standard library argument order modifications introduced in Mercury 11.07, allow the above predicate to be written more concisely as:

(Of course, the update functions allowed to be used with such a syntax must be restricted to those which commute with each other.)

Review

That’s all for now, this post is long enough! Today we saw the overall structure of our checkers AI – ten small modules – and the implementations of color, piece, position, and board, the modules defining the “parts” of the game. Next week we’ll see the implementations of the “rules” and “brains” of the game.

Wednesday, November 9, 2011

I know it’s been a while since my last post, but I have something fantastic in the works, I promise. Today though I’d like to talk about the compiler grades.

Common grades

Compiler grades correspond roughly to combinations of different backends (e.g. C or Java) and code models (e.g. parallel or debugging). Those of you new to the language may be confounded by the immense number of these grades, and when initially compiling Mercury it can be difficult to decide which grades to enable. Here’s a list of the ones I use most. In parentheses next to the grade names are the options needed to select the grade, assuming that asm_fast.gc is the default (it usually is).

asm_fast.gc (default)

This is my default go-to grade. It compiles to low-level C code using the Boehm garbage collector and supports concurrency (though not parallelism). It is therefore very fast, can interface seamlessly with C libraries, and can run code that uses threads.

asm_fast.par.gc.stseg (--parallel --stack-segments)

This grade adds parallelism and stack segmentation to the asm_fast.gc grade. Parallelism allows programs to make use of multiple cores (though note that this isn’t necessary if all you need is concurrent execution). Stack segmentation allows pieces of the stack to be allocated on the heap. Beside allowing programs to grow beyond a fixed stack size, this option (only available for the asm_fast grades) greatly reduces the amount of memory allocated by thread spawns, which can otherwise measure in the gigabytes. Both these features come at a small runtime cost.

hlc.gc (-H)

This is the “high-level C” grade. It compiles to C code which is much closer to what a programmer might write than that generated by the asm_fast grade. Curiously this grade is sometimes faster than the low-level grade (likely due to an increased number of optimization opportunities available to GCC), so if you need speed, it is worth testing your application in this grade as well. Unlike asm_fast however, this grade does not support concurrency.

hlc.par.gc (-H --parallel)

This grade adds to hlc.gc support for concurrency and parallelism via native system threads. If your application requires a 1:1 mapping of Mercury threads to system threads, then this grade is a must.

java (--java)

As the name implies, this grade compiles to Java code. It is the only option if you need to interface with Java libraries. Due to the JVM’s speed the code it generates is often as fast as its C counterparts. This grade supports concurrency and parallelism via Java’s built-in threading mechanisms (--parallel is not needed). (Note that unlike in the C grades, a program compiled under the java grade will not wait for all spawned threads to terminate before exiting.) Unfortunately some standard library procedures are not yet implemented.

asm_fast.gc.prof (-p); hlc.gc.prof (-H -p)

These grades allow you to profile your code using mprof, of course at the expense of execution speed. They are handy to have around.

asm_fast.gc.debug (--debug)

This grade allows you to debug your code using mdb. Note that if you’re not debugging the standard library, then instead of compiling this grade, you can instead pass the --trace deep option to the compiler to enable debugging of just your modules.

Compiling extra grades

We all know that the Mercury build process takes forever. This is because it must compile the standard library in each of the required grades. Fortunately there’s a shortcut around this.

The first time you install the system, instead of make install, run make install LIBGRADES=asm_fast.gc. This will install the compiler, but only the default library grade. You can pass LIBGRADES= if you wish instead to install no library grades.

Later, when you wish to install additional grades, run make install_grades LIBGRADES='asm_fast.par.gc hlc.gc' (for example). This will compile only the given library grades. To complete the installation, you should edit /path/to/mercury/lib/mercury/conf/Mercury.config to add the appropriate --libgrade flags to the end of DEFAULT_MCFLAGS (near the end of the file), and /path/to/mercury/lib/mercury/mmake/Mmake.vars to add the grade names to the definition of LIBGRADES. (These steps are only necessary if you plan to build and install libraries.)

Don’t forget also that you can speed up compilation by passing PARALLEL=-j5 (where 5 is one plus the number of parallel cores in your system) to make (or simply -j5 to mmake).

Working with multiple grades

Finally, when testing with multiple grades, you may with to use the --use-grade-subdirs option in conjunction with mmc --make. This will cause mmc --make to use a different subdirectory for each grade in which you build your project, avoiding errors caused by linking incompatible grades, and allowing you not to have to rebuild the entire project each time you switch grades.

The string “if (x > 0) x--; return;” will be parsed once by the first branch, which fails when it finds “return” instead of “else”, and will then be reparsed by the second branch, which will succeed. This sort of prefix-repetition can be eliminated by factoring out the repeated portion:

Pretty innocuous, right? Take a look at each of the productions it invokes. Almost all of them start with a call to kw, which in turn calls token. In fact, all of them start with a call, directly or indirectly, to token. In fact, this is true of every production in the parser. token is not the only culprit but it is the most pervasive.

This problem can’t be solved through a simple restructuring of the grammar. For this we need a more powerful tool: memoization.

Memoization

Memoization is one of my favorite features. Simply put, memozation of a procedure means maintaining a table of its past input and output values, and looking up output values in this table when the procedure is called in the future with the same input values. Of course, this only works if the procedure has no desirable side effects; fortunately this is true of most Mercury code. Mercury provides built-in support for memoization via the memo pragma. Basic usage is so:

:- pragma memo(Name/Arity).

By default, Mercury memoizes inputs by value. This will catch all repeated calls to a procedure, but can be slow for large structures. We can specify that we would like Mercury to memoize inputs by address instead:

:- pragma memo(Name/Arity, [fast_loose]).

Memoizing by address is much faster but may miss calls with identical values at different addresses. However, because our primary data structure is a list of characters which is very large (meaning memoize-by-value will be slow) and which we are only destructing (meaning identical sublists of this list must be physically the same), we will memoize by address.

Profiling

Now, although we could memoize every procedure in our lexer and parser, memoization does have overhead, and so we should not use it on procedures which are not called often or with repeated inputs. We should limit our scope to grammar rules which are invoked by many different rules, or multiple times by one rule, since these are likely to be invoking the rule repeatedly on the same input.

To find such candidates for memoization, we can use runtime code profiling. To generate profiling data, we can simply compile with the -p switch and then run our program with our test from last week (minus the memoization of token we added at the end):

Over 95% of that run was spent just checking for keywords and punctuators! In particular, the biggest culprit, string.to_char_list, is highly suspect. To find out who’s calling it, we can run mprof -c [note: sometimes I have had to run mprof -cd; I can’t reproduce this now so I’m not sure why] and search for the callers of string.to_char_list:

The [21] in the left-most column indicates that this section is dedicated to procedure #21, string.to_char_list, which was called 17,602,295 times. The two lines above this line indicate that both lexer.keyword_rec and lexer.punctuator_rec call string.to_char_list, 7,487,945 and 10,114,350 times respectively. Of course; we use string.to_char_list to convert our keywords and punctuators to character lists for these two procedures to match against the input data. But clearly we don't have 7,487,945 keywords or 10,114,350 punctuators, so string.to_char_list must be getting called with the same arguments repeatedly. Let’s fix that:

Unfortunately we can’t instruct Mercury to memoize procedures in other modules, so we must define to_char_list_memo as above. We can replace the two calls to it in lexer.keyword_rec and lexer.punctuator_rec with the memoized version. Compiling and reprofiling tells us:

Tracking down token

We’ve reduced the percentage of total runtime consumed by to_char_list, but it’s still getting called an awful lot. We know keyword_rec and punctuator_rec don’t call it any more often than they need to, so we need to find out who’s calling them. Let’s follow keyword_rec:

And there’s the culprit. token is called by many other procedures, and as discussed above, almost certainly with repeated inputs. Let’s memoize token.

:- pragma memo(token/3, [fast_loose]).

Now (as we found out last week), parsing our simple test completes in a reasonable amount of time. Let’s give our parser something bigger to chew on.

Factoring out multiple calls from multiplicative_expression

For stress testing, I used a 6kb program from Glen McCluskey’s C99 test suite, tdes_047.c. (It is available in the set of free samples if you wish to try it yourself.) We can use this (or any C99 code) as follows:

$ cpp -P c9_samp/tdes_047.c | ./c_parser_direct

And of course, we are met with no end to the parsing. mprof still tells us that token is responsible for the most CPU usage, but since we’ve already memoized it, let’s work our way up the chain some more. Who is responsible for most of the calls to token?

It seems like postfix_expression_rec is the winner; it is responsible for over half the calls to token. Tracing it back similarly to how we traced keyword_rec, we find that it is called solely by postfix_expression 8,149,775 times, which is called solely by unary_expression a similar number of times, which is called almost exclusively by cast_expression a similar number of times, which is called almost exclusively by multiplicative_expression, which is called… only 2,054,666 times:

In other words, multiplicative_expression is responsible for multiplying the number of calls passing through it to token fourfold. Let’s look at multiplicative_expression (again, with Typedefs elided):

Ad nauseum

And that’s about all there is to it: following this method I found I needed to memoize two more procedures and factor one more, and the parser parsed the test file faster than time could reliably distinguish. Of course, it would behoove us to test on different inputs in order to exercise different grammar productions, but the method is the same.

Summary

To recap:

Use mprof -c to determine which procedure is the largest time hog.

If the procedure is called by primarily one other procedure on a 1:1 basis, or if you have already optimized it, look instead at its most prominent caller (etc.).

If the procedure is called by primarily one other procedure on a greater than 1:1 basis and it is a common prefix of this caller, factor it out.

Otherwise, memoize it.

Recompile and repeat.

This simple method will asymptotically increase the speed of almost any naïve PEG parser, in my experience, from unusable to quite fast, and will avoid the overhead associated with unnecessarily memoizing every grammar production in a parser.

Further reading

The direct encoding of PEGs in Mercury, and the performance implications thereof, were first explored in-depth by Ralph Becket and Zoltan Somogyi in the paper DCGs + Memoing = Packrat Parsing; But is it worth it?; I highly recommend it as further reading for those interested in the subject.

Sunday, October 9, 2011

Continuing last week’s post, I’d like to give an example of implementing a real-life PEG. What better than the C language, specifically, preprocessed C99. (Note: It may be helpful to the reader to open the linked PDF to Annex A, “Language syntax summary”.)

Recall that PEGs differ from more commonly used CFGs in that they are completely deterministic, greedy, and don’t allow left recursion. This can make translating some grammars difficult or impossible if they rely on those features. Fortunately, C99’s grammar is straightforward enough that, with a little thought, we can work around these problems. Let’s see how.

The Lexer

First off, let’s create a submodule for the lexer like so:

:- module lexer.
:- interface.

From the lexer we will export two predicates, one which lexes single tokens (token), and one which lexes a sequence of whitespace (whitespace):

Note: it’s important that we define the mode of the token argument of token as out rather than in, even though they are logically equivalent. This will allow us later to memoize token on one argument instead of two, greatly reducing our memory use at negligible (if any) processing overhead. This trick is possible precisely because PEGs are deterministic: the determinism of token with token as out will still be semidet, rather than nondet (and hence unmemoizable) as it would be with a CFG.

The C99 standard defines 5 kinds of lexical tokens our lexer will need to generate: keyword, identifier, constant, string-literal, and punctuator. A glance through the C99 grammar reveals that merely to parse the grammar, we do not need to know the specific values of constants and string-literals, so let’s define the type of a lexing token like so:

Of course, if we were constructing a parse tree, we would want to hold onto the values of constants and string-literals, but for now since we are just matching the grammar it is OK to leave them out. Onto the implementation!

:- implementation.

Whitespace, or, using built-in character classes

Let’s get whitespace out of the way. The C99 definition of whitespace corresponds to Mercury’s definition of whitespace, so we can import char and define whitespace like so:

However because this is an unordered choice in the spec, but an ordered choice in Mercury, we need to apply a moment’s thought that choosing this ordering will not change the language. Fortunately this is easy to confirm: the latter four each must start with a different character (letter, digit, quote, and non-quote symbol, respectively), and the first (keyword) should be matched in favor of the second (identifier) anyway.

The C99 spec assumes that whitespace is removed in some phase in-between lexing and parsing, and that string literals are concatenated. Because we will be attaching our lexer directly to our parser (such is the magic of PEGs), we will need to perform these two steps in our tokenizer. Finally, we will want to actually produce a lexing token as well (rather than just match it). The final implementation of token is thus:

But the devil is in the details. We must ensure that no keyword is matched against the prefix of another keyword or identifier (e.g. we should not match the “do” in the keyword “double”, or the “for” in the identifier “format”). The latter can be fixed by using a negative lookahead assertion:

In other words, a keyword must not be followed immediately by one of the characters which is valid in a keyword or identifier (defined later). Note that this won’t fix partial keyword matching, since if the assertion fails the parse as a keyword will be rejected entirely. Either we should move the negative lookahead assertion into the bodies of the branch, or simply rearrange the offending keywords from longest to shortest. Since “do” and “double” are the only offenders, it’s easy enough to swap them. (We’ll encounter a better example later.)

(Note: In the completed lexer linked below, I add a little extra machinery to allow listing keywords as strings in a list. There is nothing PEG-specific about this technique, and it saves me only keystrokes.)

This will either match a single identifier_nondigit, or, failing to do so, will get caught in a loop calling itself again having consumed no input. Not what we wanted. We need to split the initial call to identifier_nondigit out, and write a right-recursive rule for the rest of the identifier:

Constants, or, fixing choice using order

Now, the astute reader will notice that something will go wrong here: an integer literal is also a valid floating-point literal in C99, so if we wrote this production directly as a PEG, it would always match the integer part of a floating-point literal as an integer. Again, not what we want.

We could use a negative lookahead assertion as above to fix this, but easier still is to change the order, prioritizing floats about integers:

Easy enough! Since enumeration constants must always start with a letter (which floats and integers do not), and character constants must always start with a single quote, we are all set here.

The rest of the lexer for constants and string literals uses one of these techniques so let’s move on to the last lexing token, punctuators.

Punctuators, or, fixing choice by ordering by length

punctuator, like keyword, is merely a list of strings of symbols which are meaningful in C99. This list presents the same problem that the list of keywords presents. Namely, that a shorter punctuator may match the prefix of what is in fact a longer punctuator. (For example, < may match the symbol <<= in source text.)

For this production, we can use the solution of ordering the punctuators from longest to shortest. This works without much further thought. (Again, I use some extra code to allow me to define this production as a list of strings in my example code.) And that about does it for the lexer.

:- end module lexer.

The Parser

The parser is, largely, trivial, using the same four techniques described above for the lexer (rewriting left-recursion and the three ordering techniques). Since the grammar makes lots of reference to individual kinds of tokens, I defined a few shortcuts:

The grammar also references the enumeration-constant production, which, although not a token directly, is merely an identifier and we can treat it as such:

enumeration_const --> id(_).

Ensuring that all calls to the lexer are funneled through token will make our lives easier when we optimize later.

Now before I sign off, there are two gotchas in the grammar, for both of which C is infamous.

The “dangling else”

C’s if/then/[else] has the problem that in a CFG, it is ambiguous. Does if (1) if (0) puts("hi"); else puts("bye"); print bye or nothing? The C99 standard indicates that an else should match the nearest if it legally can (so the above code would print bye).

Fortunately with PEGs, this ambiguity is trivial to solve. We need simply write the if/then/else case first in the grammar. Any else will be matched by the innermost (and thus nearest) if possible:

Typedefs

For the unfamiliar, a typedef in C is a means of defining an alias for an existing type specification. There’s nothing odd about that, except that types in C are keywords and the set of types determines the parse. In other words, C is not context-free.

Even though PEGs can directly represent some non-context-free grammars, C is not one of them. Fortunately the problem is easily solved by parameterizing the grammar and passing around a set of “typedefed” identifiers:

:- import bool, set.
:- type typedefs == set(string).

Typedefs are created in declarations, if one of the declaration-specifiers is a storage-class-specifier which is the keyword typedef:

(Note the use of state variable syntax, and the factored-out declaration_specifiers_opt, which makes the code passably legible.) Typedef declarations may legally exist in block-items, scoped within compound-statements:

The rest of the implementation of typedefs involves parameterizing most productions to pass the set of typedefs from external_declaration down to typedef_name, and parameterizing the few illegal uses of typedefs (for example, in function signatures) to simply drop the modified set of typedefs. Nothing brain-bending; just a lot of typing.

Summary & Test

In today’s post we saw how to translate a real language grammar, that of C99, into a Mercury program. We learned how to transform left-recursive rules into right-recursive ones, how to ensure proper production choice by using negative lookahead assertions, ordering by precedence, and ordering by length, and how to implement the non-context-free elements of C99.

Let’s add a main function similar to yesterday’s, compile, and try out our spiffy new parser:

$ ./c_parser_directint main(void) {}

…and wait, and wait, and wait, and… what’s going on? Why are we getting no response? Do we have an infinite loop?

Nope. Add this line in the lexer module:

:- pragma memo(token/3, [fast_loose]).

$ ./c_parser_directint main(void) {}

(wait for it…)

Success!

Success indeed, but it’s still slow. More on memoization and how to speed this up even more in part 3.

Sunday, October 2, 2011

First, I’d like to say welcome back! AiM is returning from its hiatus since April. The summer has been busy for me, but I hope to update AiM weekly henceforth. Let’s start!

Parsing Expression Grammars

Parsing expression grammars (PEGs) are a new-ish formal grammar intended to describe computer languages. PEGs differ from the more commonplace context-free grammars most notably in that PEGs offer an ordered choice operator. The second branch of an ordered choice implicitly does not match any expression which is matched by the first branch. Since any text can only match exactly one branch of an ordered choice, this has the practical implication that a PEG cannot be ambiguous and rule precedence declarations are not needed.

Another natural consequence of ordered choice is that so-called repetition operators (* and +) and the option operator (?) are necessarily greedy. (Any time a grammar rule can recur, ordered choice ensures that it does recur, since the option of not recurring will implicitly not match.)

A PEG rule, as described in the Wikipedia entry, may be either:

a terminal (a symbol),

a nonterminal (another rule), or

the empty string.

or may be constructed from other rules by:

sequence: AB

ordered choice: A / B

zero or more: A*

one or more: A+

optional: A?

positive lookahead: &A

negative lookahead: !A

Positive lookahead and negative lookahead test for a rule match (or lack thereof) but do not consume input.

Translating into Mercury

It’s well-known that CFGs can be implemented directly in logic languages such as Mercury as recursive-descent parsers. Prolog and Mercury even have a special syntax known as definite clause grammar (DCG) to facilitate doing so. It’s similarly easy to implement PEGs, and there’s a surprise at the end!. Let’s see how.

PEG rules can be represented using DCG clauses like so:

rule --> body

The type and mode of rules can be inferred by Mercury, but if you need to declare them, you can do so as:

:- pred rule(list(char)::in, list(char)::out) is semidet.

The elements of the body of a PEG are translated as follows:

terminal C

[C]

nonterminal A

A

empty string

[]

sequence AB

A, B

ordered choice A / B

A -> []; B
Note that using -> [] ; (if/else) instead of ; gives us exactly the greedy semantics we require.

optional A?

A -> []; []

positive lookahead &A

not not A

or equivalently, =(S), A, :=(S)

negative lookahead !A

not A

Repetition constructions (A* and A+) are recursive in nature and cannot be represented inline. However we can define higher order rules to represent them as:

Monday, April 4, 2011

This is the first post in a series on dependent typing. For those not familiar with dependent typing, it is the ability to describe data using types which say something about the particular value of the data itself – the type of the data depends on the value of the data. For example, the phrase “even integers” could describe a dependent type which is a subtype of integers. Whether a number belongs to that type depends on whether its value is evenly divisible by 2. (Note that any subtype can be viewed as a simple kind of dependent type.)

The obvious use of dependent typing is to describe invariants of your program, for example, “a non-NULL pointer”, “an array of length 5”, or “a non-empty list”. While full-blown dependent typing requires a full-blown theorem prover (such as Coq or ATS/LF), many mainstream languages support basic dependent types. Java can infer whether a value is not null, OCaml can talk about types which are made up of only certain constructors, Mozilla’s new language Rust can attach arbitrary predicates to types, and I would be remiss not to mention Mercury.

Mercury has limited but useful support for dependent typing via its mode system. The mode system appears complicated, so let’s start with a brief overview.

The mode system

Each predicate (or function) in Mercury does something to its arguments. The most common possibilities are to look at an argument and leave it be (this would be an input argument), or to attempt to bind it to a value (this would be an output argument). Another possibility is that it destroys its argument after looking at it (this happens with the IO state). These possible actions are called modes.

Modes are described simply by the expected state of the argument before the call, and the guaranteed state of the argument after the call using the syntax Before >> After. These states (called instantiatednesses or insts for short) may be one of free (the argument is unbound), ground or bound (the argument is bound to a value), clobbered (the argument is destroyed or overwritten), and unique (the argument is bound to a value which is not referenced anywhere else). The latter two are used mainly for IO state and arrays, so we will focus on free and bound.

The modes we are all familiar with, in and out, are defined in the module builtin as ground >> ground and free >> ground, respectively. In other words, an in argument is expected to be bound to a value before the call and is left that way after, and an out argument may be unbound before the call, but will be bound after.

The magic is that we may tell Mercury that we expect or guarantee an argument to be bound only to certain constructors! To do this, we must first define the instantiatedness in which we are interested, say non-empty lists:

:- inst non_empty_list == bound([ground | ground]).

This defines the non_empty_list inst as a variable bound to a list cons cell, the head and tail of which are both bound. We could similarly define the inst of an empty list as:

:- inst empty_list == bound([]).

In other words, an empty_list is a variable which is bound to the empty list constructor.

Note that there is also a shorthand which allows the above two insts to be written similarly to type definitions:

We could then define the mode of an input argument which is expected to be a non-empty list as:

:- mode in_non_empty_list == non_empty_list >> non_empty_list.

In other words, an argument with this mode would be expected to be bound to a non-empty list, and would remain bound that way afterward. Similarly:

:- mode out_non_empty_list == free >> non_empty_list.

An argument with this mode may be unbound before the call, but is guaranteed to be bound to a non-empty list after. Mercury actually has the mode-functions in() and out() already defined, so we may write in(non_empty_list) or out(non_empty_list) to describe these same modes.

Example

Note: due to bug 191, the following example won’t trigger the proper error messages from the compiler as of this post. Download and apply this patch against the compiler to fix the issue, or work around the bug by replacing the two out modes with out(list(ground)), which won’t change the meaning but will avoid triggering the bug. UPDATE: parts of the standard library won’t compile without the bug, so don’t go applying that patch just yet.

On a recent thread at Hacker News, a question was brought up about proving that [X | Xs] = reverse([Y | Ys]) was total (or in Mercury terms, det). Let’s see if we can do this in Mercury using our new tools. First let’s define reverse2 (named so not to conflict with built-in list.reverse) using a recursive auxiliary function with an accumulator:

A sharp eye will spot the problem before we even compile: the deconstruction [X | Xs] = can fail, which would cause main to be semidet instead of det (that is, a partial function). The compiler spots this as well:

Of course this isn’t true. What the compiler would love to see is a mode declaration stating that if reverse2 is passed a non-empty list, then it will return a non-empty list. Let’s write that:

:- mode reverse2(in(non_empty_list)) = out(non_empty_list).

Now we get a different error:

In clause for `reverse2(in((list.non_empty_list))) =
out((list.non_empty_list))':
mode error: argument 2 did not get sufficiently instantiated.
Final instantiatedness of `HeadVar__2' was `ground',
expected final instantiatedness was `bound(list.'[|]'(ground,
ground))'.

(Note: if you didn’t/can’t apply the above compiler patch, as a workaround, replace out in the original mode declarations with out(list(ground)).)

Here, HeadVar__2 refers to the function result (what would be the second variable in the head were reverse2 a predicate). The compiler expected the function result to be non-empty (bound to list.[|]) but it was bound to an unknown value of type list. Of course this is because reverse2_aux doesn't guarantee a non-empty result on non-empty input, so let’s add a declaration stating so as well:

:- mode reverse2_aux(in(non_empty_list), in) = out(non_empty_list).

(Note that we don’t say anything about the accumulator in this declaration.) Unfortunately recompiling gives us a similar error, this time complaining about the second (the recursive) clause of reverse2_aux. Of course – if our initial list had only one item, the recursive call would pass reverse2_aux an empty list, triggering the first clause, and we haven’t made (and can’t make) any special guarantees about that case. Let's take a closer look at that clause:

reverse2_aux([], Acc) = Acc.

The only way that this clause will return a non-empty list is if the accumulator, Acc, is non-empty. Fortunately, it is called recursively with a non-empty accumulator (since elements are prepended on every recurrence), so this should be enough to convince the compiler. Let’s add a mode declaration stating this fact:

:- mode reverse2_aux(in, in(non_empty_list)) = out(non_empty_list).

Success! Compilation is successful, meaning main has been proved to be det (a total function), and running the program gives the expected output:

{5, [4, 3, 2, 1]}

Using Mercury’s mode system, we were able to prove that reversing a non-empty list results in a non-empty list, by adding descriptive mode declarations in strategic locations.

As an exercise, add an inst describing a list of at least two elements and add modes to reverse2 and reverse2_aux which make guarantees about lists of at least two elements.