Chapter 10. Using Parsec

The task of parsing a file, or data of various types, is a common one for
programmers. You already learned about Haskell's support for regular
expressions back in the section called “Regular expressions in Haskell”. Regular expressions
are nice for many tasks. But they can become quite complex and difficult
when dealing with a data format that is complex. For instance, you
probably wouldn't want to use regular expressions to parse C source code.

Haskell has a very useful and unique library called Parsec. Parsec
is a parser combinator library. With Parsec, you will combine small
parsing functions together to build up more complex parsers. Parsec
provides some simple parsing functions, as well as functions to tie them
all together. It should come as no surprise that this parser library for
Haskell is built around the notion of functions.

Those of you that are familiar with parser tools in other languages may
be interested to know that Parsec can take the place of both parser tools
such as Yacc and lexers such as flex. Parsec can also be used to
implement a single-stange parser.

In this chapter, we will use Parsec to build up a parser for CSV files.
We will also use Parsec to build a more complex parser for sectioned
configuration files.

First Steps with Parsec: Simple CSV Parsing

Let's jump right in by writing some code for parsing a CSV file.
CSV files are often used as a plain text representation of spreadsheets
or databases. Each line is a record, and each field in the record is
separated from the next by a comma. There are ways of dealing with
fields that contain commas, but to start with, we won't worry about it.

This first example is much longer than it really needs to be. We will
introduce more Parsec features in a little bit that will shrink the
parser down to only four lines!

import Text.ParserCombinators.Parsec
{- A CSV file contains 0 or more lines, each of which is terminated
by the end-of-line character (eol). -}
csvFile :: GenParser Char st [[String]]
csvFile =
do result <- many line
eof
return result
-- Each line contains 1 or more cells, separated by a comma
line :: GenParser Char st [String]
line =
do result <- cells
eol -- end of line
return result
-- Build up a list of cells. Try to parse the first cell, then figure out
-- what ends the cell.
cells :: GenParser Char st [String]
cells =
do first <- cellContent
next <- remainingCells
return (first : next)
-- The cell either ends with a comma, indicating that 1 or more cells follow,
-- or it doesn't, indicating that we're at the end of the cells for this line
remainingCells :: GenParser Char st [String]
remainingCells =
(char ',' >> cells) -- Found comma? More cells coming
<|> (return []) -- No comma? Return [], no more cells
-- Each cell contains 0 or more characters, which must not be a comma or
-- EOL
cellContent :: GenParser Char st String
cellContent =
many (noneOf ",\n")
-- The end of line character is \n
eol :: GenParser Char st Char
eol = char '\n'
parseCSV :: String -> Either ParseError [[String]]
parseCSV input = parse csvFile "(unknown)" inputimport Text.ParserCombinators.Parsec
{- A CSV file contains 0 or more lines, each of which is terminated
by the end-of-line character (eol). -}
csvFile :: GenParser Char st [[String]]
csvFile =
do result <- many line
eof
return result
-- Each line contains 1 or more cells, separated by a comma
line :: GenParser Char st [String]
line =
do result <- cells
eol -- end of line
return result
-- Build up a list of cells. Try to parse the first cell, then figure out
-- what ends the cell.
cells :: GenParser Char st [String]
cells =
do first <- cellContent
next <- remainingCells
return (first : next)
-- The cell either ends with a comma, indicating that 1 or more cells follow,
-- or it doesn't, indicating that we're at the end of the cells for this line
remainingCells :: GenParser Char st [String]
remainingCells =
(char ',' >> cells) -- Found comma? More cells coming
<|> (return []) -- No comma? Return [], no more cells
-- Each cell contains 0 or more characters, which must not be a comma or
-- EOL
cellContent :: GenParser Char st String
cellContent =
many (noneOf ",\n")
-- The end of line character is \n
eol :: GenParser Char st Char
eol = char '\n'
parseCSV :: String -> Either ParseError [[String]]
parseCSV input = parse csvFile "(unknown)" input

Before looking at all this code in detail,
let's play with it a bit and see what
it does.

Look at that. Recall how we defined that each line must end with the
end-of-line character, and we didn't give it. Parsec's error message
helpfully indicated the line number and column number of the problem,
and even told us what it was expecting! Let's give it an end-of-line
character and continue experimenting.

You can see that parseCSV is doing exactly what we
wanted it to do. It's even handling empty cells and empty lines
properly.

Let's take a look at the code for this example. We didn't use many
shortcuts here, so remember that this will get shorter and simpler!

We've built it from the
top down,
so our first function is csvFile. The type of this
function is GenParser Char st [[String]]. This
means that the type of the input is a sequence of characters, which is
exactly what a Haskell string is, since String is
the same as [Char]. It also means that we will
return a value of type [[String]]: a list of a list
of strings. The st can be ignored for now.

Parsec programmers usually omit type declarations, since we write so
many small functions. Haskell's type inference can figure it out.
We've listed the types for the first example here so you can get a
better idea of what's going on. You can always use
:t in ghci to inspect types as well.

The csvFile uses a do block. Parsec is a monadic
parser combinator, but we are not using the IO monad here.
As we discussed in fixme: insert ref, not all monads
are IO monads.

We start by running many line.
many is a function that takes a function as an
argument. It tries to repeatedly
parse the input using the function passed to it. It gathers up the
results from all that repeated parsing and returns a list of them. So,
here, we are storing the results of parsing all lines in
result. Then we look for the end-of-file indicator,
called eof. Finally, we return the
result. So, a CSV file is made of of many lines,
then the end of file. You can often read out Parsec functions in plain
English just like that.

Now we must answer the question: what is a line? We define the
line function to do just that. Reading the
function, we can see that a line consists of cells followed by the end
of line character.

So what are cells? We defined them in the cells
function. The cells of a line start with the content of the first
cell, then continue with the content of the remaining cells, if any.
The result is simply the first cell and the remaining cells assembled
into a list.

Let's skip over remainingCells for a minute and
look at cellContent. A cell contains any number of
characters, but each character must not be a comma or end of line
character. The noneOf function matches one item, so
long as it isn't in the list of items that we pass. So, saying
many (noneOf ",\n") defines a cell the way we want
it.

Back in remainingCells, we have the first example of
a choice in Parsec. The choice operator is <|>. This operator
behaves like this: it will first try the parser on the left. If it
consumed no input, it will try the parser on the right.

So, in remainingCells, our task is to come up with
all the cells after the first. Recall that
cellContent uses noneOf ",\n".
So it will not consume the comma or end-of-line character from the
input. If we see a comma after parsing a cell, it means that at least
one more cell follows. Otherwise, we're done. So, our first choice in
remainingCells is char ','. This
parser simply matches the passed character in the input. If we found a
comma, we want this function to return the remaining cells on the line.
At this point, the "remaining cells" looks exactly like the start of
the line, so we call cells recursively to parse
them. If we didn't find a comma, we return the empty list, signifying
no remaining cells on the line.

Finally, we must define what the end-of-line indicator is. We set it
to char '\n', which will suit our purposes fine for
now.

At the very end of the program, we define a function
parseCSV that takes a String and parses it as a
CSV file. This function is just a shortcut that calls Parsec's
parse function, filling in a few parameters.
parse returns Either ParseError
[[String]] for the CSV file. If there was an error, the
return vaule will be Left with the error; otherwise, it will be
Right with the result.

The sepBy and endBy Combinators

We promised you earlier that we could simply our CSV parser
significantly by using a few Parsec helper functions. There are two
that will dramatically simplify this code. The first is
endBy, which takes two functions. It applies the
first function, notes its result, then applies the second. It will do
this over and over, and return a list of the results from the first
function.

The second tool for us is sepBy. It's like
endBy, but expects the very last item to not end
with the separator.

So, we can use endBy to parse lines, since every
line must end with the end-of-line character. We can use
sepBy to parse cells, since the last cell will not
end with a comma. Take a look at how much simpler our parser is now:

This program behaves exactly the same as the first one. You can verify
that by using ghci to re-run our examples from the earlier example.
You'll get the same result from every one. Yet the program is much
shorter and more readable. It won't be long before you can translate
Parsec code like this into a file format definition in plain English.
As you read over this code, you can see that:

A CSV file contains 0 or more lines,
each of which is terminated
by the end-of-line character.

A line contains 0 or more cells, separated by a comma.

A cell contains 0 or more characters, which must be
neither the comma nor the end-of-line character.

The end-of-line character is the newline,
\n.

Choices and Errors

Different operating systems use different characters to mark the
end-of-line. Unix/Linux systems, plus Windows in text mode, use
simply "\n".
FIXME: verify this on Windows
DOS and Windows systems use "\n\r", and Macs
traditionally use "\r\n". We could also support
a bare "\r" in case anybody uses that.

We could easily adapt our example to be able to handle all these types
of line endings in a single file. We would need to make two
modifications: adjust eol to recognize the different
endings, and adjust the noneOf pattern in
cell to ignore \r.

This must be done carefully. Recall that our earlier definition of
eol was simply char '\n'. There
is a parser called string that we can use to match
the multi-character patterns. Let's start by thinking of how we would
add support for \n\r.

This isn't quite right. Recall that the <|> operator always
tries the left alternative first. Looking for the single character
\n will match both types of line endings, so it will
look to the system that the following line begins with
\r. Not what we want. Try it in ghci:

It may look like it worked for both endings, but actually looking at it
this way, we can't tell. If it left something un-parsed, we don't
know, because we're not looking for anything else. So let's look for
the end-of-file after our end of line:

This also isn't right. Recall that <|> only attempts the option
on the right of the option on the left consumed no input. But by the
time we are able to see if there is a \r after the
\n, we've already consumed the
\n. This time, we fail on the other case in ghci:

We've stumbled upon the lookahead problem. It turns out that, when
writing parsers, it's often very convenient to be able to "look ahead"
at the data that's coming in. Parsec supports this, but before showing
you how to use it, let's see how you would have to write this to get
along without it. You'd have to manually expand all the options after
the \n like this:

This function first looks for \n. If it is found,
then it will look for \r, consuming it if possible.
Since the return type of char '\r' is a
Char, the alternative action is to simply return a
Char without attempting to parse anything. Parsec
has a function option that can also express this
idiom as option '\n' (char '\r'). Let's test this
with ghci.

This time, we got the right result! But we could have done it easier
with Parsec's lookahead support.

Lookahead

Parsec has a function called try that is used to express
lookaheads. try takes one function, a parser. It applies that
parser. If the parser doesn't succeed, try behaves as if it hadn't
consumed any input at all. So, when you use try on the left side
of <|>, Parsec will try the option on the right even if the
lift side failed after consuming some input. try only has an
effect if it is on the left of a <|>. Keep in mind, though,
that many functions use
<|> interally. Sometimes you may want
to add it to your code just in case it will be used on the left of a
<|>, and that works fine. Here's a way to add
expanded end-of-line support to our CSV parser using try:

Here we put both of the two-character endings first, and run both
tests under try. Both of them occur to the left of a <|>, so
they will do the right thing. We could have put string
"\n" within a try, but it wouldn't have altered any
behavior since they look at only one character anyway. We can load
this up and test the eol function in ghci.

As you can see, this program even supports different line endings
within a single file.

Error Handling

At the beginning of this chapter, you saw how Parsec could generate
error messages that list the location where the error occured as well
as what was expected. As parsers get more complex, the list of what
was expected can become cumbersome. Parsec provides a way for you to
specify custom error messages in the event of parse failures.

We added to the error result, but didn't really help clean up the
output. Parsec has an <?> operator that is designed for just these
situations. It is similar to <|> in that it first tries the
parser on its left. Instead of trying another parser in the event of
a failure, it presents an error message. Here's how we'd use it:

That's pretty helpful! The general rule of thumb is that you put a
human description of what you're looking for to the right of
<?>.

Extended Example: Full CSV Parser

Our earlier CSV examples have had an important flaw: they weren't able
to handle cells that contain a comma. CSV generating programs
typically put quotation marks around such data. But then you have
another problem: what to do if a cell contains a quotation mark and a
comma. In these cases, the embedded quotation marks are doubled up.

Here is a full CSV parser. You can use this from ghci, or if you
compile it to a standalone program, it will parse a CSV file on
standard input and convert it a different format on output.

That's a full-featured CSV parser in just 21 lines of code, plus an
additional 10 lines for the parseCSV and
main utility functions.

Let's look at the changes in this program from the previous versions.
First, a cell may now be either a bare cell or a "quoted" cell. We
give the quotedCell option first, because we want to
follow that path if the first character in a cell is the quote mark.

The quotedCell begins and ends with a quote mark,
and contains zero or more characters. These characters can't be copied
directly, though, because they may contain embedded, doubled-up, quote
marks themselves. So we define a custom quotedChar
to process them.

When we're processing characters inside a quoted cell, we first say
noneOf "\"". This will match and return any single
character as long as it's not the quote mark. Otherwise, if it is the
quote mark, we see if we have two of them in a row. If so, we return a
single quote mark to go on our result string.

Notice that try in quotedChar on the
right side of <|>. Recall that I said that
try only has an effect if it is on the left side of <|>. This
try does occur on the left side of a <|>, but on the left of
one that must be within the implementation of many.

This try is actually quite important. Let's say you are parsing a
quoted cell and are getting towards the end of it. There will be
another cell following. So you will expect to see a quote to end the
current cell, followed by a comma. When you hit
quotedChar, you will fail the
noneOf test and proceed to the test that looks for
two quotes in a row. You'll also fail that one because you'll have a
quote, then a comma. If you hadn't used try, you'd crash with an
error at this point, saying that it was expecting the second quote,
because the first quote was already consumed. Since we use try, this
is properly recognized as not a character that's part of the cell, so
it terminates the many quotedChar expression as
expected. Lookahead has once again proven very useful, and the fact
that it is so easy to add makes it a remarkable tool in Parsec.