On Wed, 2009-10-07 at 11:32 -0400, Joe Fredette wrote:
> So, if I understand this:
>> Parsec is a DSL, I'm going to venture it's a "Deep embedding" -- I
> don't understand the internals, but if I were to build something like
> Parsec, I would probably build up a "Parser" datastructure and then
> apply optimizations to it, then "run" it with another function.
>> Am I on the right track here?
Parsec, like most other parser combinator libraries, is a shallowly
embedded DSL. The "Parser a" type is a Haskell function that does
parsing, i.e. a function of type String -> Maybe (String, a).
(Obviously, the real Parsec library allows more than strings, and has
better error reporting than this type, but this is the basic idea).
You can't analyse it further---you can't transform it into another
grammar to optimise it or print it out---because the information about
what things it accepts has been locked up into a non-analysable Haskell
function. The only thing you can do with it is feed it input and see
what happens.
A deep embedding of a parsing DSL (really a context-sensitive grammar
DSL) would look something like the following. I think I saw something
like this in the Agda2 code somewhere, but I stumbled across it when I
was trying to work out what "free" applicative functors were.
First we define what a production with a semantic action is,
parameterised by the type of non-terminals in our grammar and the result
type:
> data Production nt a
> = Stop a
> | Terminal Char (Production nt a)
> | forall b. NonTerminal (nt b) (Production nt (b -> a))
You can think of a production as a list of either terminals or
non-terminals, terminated by the "value" of that production. The
non-regular nested type argument in NonTerminal means that the final
value can depend on the values that will be returned when parsing the
strings that match other non-terminals.
Productions are functors:
> instance Functor (Production nt) where
> fmap f (Stop a) = Stop (f a)
> fmap f (Terminal c p) = Expect c (fmap f p)
> fmap f (NonTerminal nt p) = NonTerminal nt (fmap (fmap f) p)
They are also applicative functors:
> instance Applicative (Production nt) where
> pure = Stop
> (Stop f) <*> a = fmap f a
> (Terminal c t) <*> a = Terminal c (t <*> a)
> (NonTerminal nt t) <*> a = NonTerminal nt (fmap flip t <*> a)
A rule in one of our grammars is just a list of alternative productions:
> newtype Rule nt a = Rule [Production nt a]
Since lists are (applicative) functors and (applicative) functors
compose, Rule nt is also a Functor and Applicative functor:
> instance Functor (Rule nt) where
> fmap f (Rule l) = Rule (fmap (fmap f) l)
> instance Applicative (Rule nt) where
> pure x = Rule $ pure (pure x)
> (Rule lf) <*> (Rule la) = Rule $ (<*>) <$> lf <*> la
It is also an instance of Alternative, because we composed with lists:
> instance Alternative (Rule nt) where
> empty = Rule []
> (Rule r1) <|> (Rule r2) = Rule $ r1 <|> r2
A grammar is a map from nonterminals to rules, which are lists of
alternative productions, which may themselves refer back to nonterminals
in the grammar:
> type Grammar nt = forall a. nt a -> Rule nt a
Given a value of type "Grammar nt", and a starting nonterminal in "nt a"
for some "a", one can easily write a function that translates it into a
Parsec grammar to do actual parsing, or implement a different parsing
strategy using memoisation or something similar. The translation to a
traditional parser combinator library is actually a
(indexed-)homomorphism of applicative functors + extra operations, which
is pretty cool.
If you also know some extra facts about the "nt" type (e.g. that it is
finite), then it should be possible implement an CYK or Earley parser
using this, or to print out the grammar (for documentation purposes, or
for telling another node in a distributed network what things you
accept, for instance).
Note that these grammars are strictly less powerful than the ones that
can be expressed using Parsec because we only have a fixed range of
possibilities for each rule, rather than allowing previously parsed
input to determine what the parser will accept in the future. This is
the fundamental reason for using the applicative functor interface
rather than the monad interface here.
I'll give an example grammar for parsing expressions modelled by the
following data type:
> data Expr = ENum Int
> | ESum Expr Expr
> | EProduct Expr Expr
> deriving Show
To define a grammar in this formalism, one first has to define the set
of nonterminals that one wants to use:
> data NT a where
> Value :: NT Expr
> Product :: NT Expr
> Sum :: NT Expr
Now, a grammar is simply a function from members of this type to
productions. We use the applicative/alternative functor interface to
build up the productions. Conor's SHE would make this look a lot nicer,
using idiom brackets.
> myGrm :: Grammar NT
> myGrm Value = ENum <$> posInt
> <|> id <$ char '(' <*> nt Sum <* char ')'
>> myGrm Product = EProduct <$> nt Value <* char '*' <*> nt Product
> <|> id <$> nt Value
>> myGrm Sum = ESum <$> nt Product <* char '+' <*> nt Sum
> <|> id <$> nt Product
This needs a couple of simple functions to make things look nice:
> char :: Char -> Rule nt ()
> char c = Rule [Terminal c $ Stop ()]
> nt :: nt a -> Rule nt a
> nt nonterminal = Rule [NonTerminal nonterminal $ Stop id]
And a general definition for parsing single-digit numbers. This works
for any set of non-terminals, so it is a reusable component that works
for any grammar:
> choice :: Alternative f => [f a] -> f a
> choice = foldl (<|>) empty
>> digit :: Rule nt Int
> digit = choice [ x <$ char (intToDigit x) | x <- [0..9] ]
>> posInt :: Rule nt Int
> posInt = fix 1 . reverse <$> some digit
> where fix n [] = 0
> fix n (d:ds) = d*n + fix (n*10) ds
Bob
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.