Easy text parsing in C# with Sprache

A few days ago, I discovered a little gem: Sprache. The name means "language" in German. It’s a very elegant and easy to use library to create text parsers, using parser combinators, which are a very common technique in functional programming. The theorical concept may seem a bit scary, but as you’ll see in a minute, Sprache makes it very simple.

Text parsing

Parsing text is a common task, but it can be tedious and error-prone. There are plenty of ways to do it:

manual parsing based on Split, IndexOf, Substring etc.

regular expressions

hand-built parser that scans the string for tokens

full blown parser generated with ANTLR or a similar tool

and probably many others…

None of these options is very appealing. For simple cases, splitting the string or using a regex can be enough, but it doesn’t scale to more complex grammars. Building a real parser by hand for non-trivial grammars is, well, non-trivial. ANTLR requires Java, a bit of knowledge, and it relies on code generation, which complicates the build process.

Fortunately, Sprache offers a very nice alternative. It provides many predefined parsers and combinators that you can use to define a grammar. Let’s walk through an example: parsing the challenge in the WWW-Authenticate header of an HTTP response (I recently had to write a parser by hand for this recently, and I wish I had known Sprache then).

The grammar

The WWW-Authenticate header is sent by an HTTP server as part of a 401 (Unauthorized) response to indicate how you should authenticate:

What we want to parse is the "challenge", i.e. the value of the header. So, we have an authentication scheme (Basic, Bearer), followed by one or more parameters (name-value pairs). This looks simple enough, we could probably just split by ',' then by '=' to get the values… but the double quotes complicate things, since quoted strings could contain the ',' or '=' characters. Also, the double quotes are optional if the parameter value is a single token, so we can’t rely on the fact they will (or won’t) be there. If we want to parse this reliably, we’re going to have to look at the specs.

The WWW-Authenticate header is described in detail in RFC-2617. The grammar looks like this, in what the RFC calls "augmented Backus-Naur Form" (see RFC 2616 §2.1):

Each rule is declared as a Parser<T>; since these rules match single characters, they are of type Parser<char>.

The Parse class from Sprache exposes parser primitives and combinators.

Parse.Chars matches any character from the specified string, we use it to specify the list of separator characters.

The overload of Parse.Char that we use here takes a predicate that will be called to check if the character matches, and a description of the character class. Here we just use System.Char.IsControl as the predicate to match control characters.

Now, let’s define a TokenChar character class to match characters that can be part of a token. As per the RFC, this can be any character not in the previous classes:

The QdText rule doesn’t require much explanation, but QuotedPair is more interesting… As you can see, it looks like a Linq query: this is Sprache’s way of specifying a sequence. This particular query means: match a backslash (named _ because we ignore it) followed by any character named c, and return just c (quoted pairs are not escape sequences in the same sense as in C, Java or C#, so "\n" isn’t interpreted as "new line" but just as "n").

We can now write the rule for a quoted string:

private static readonly Parser<string> QuotedString =
from open in DoubleQuote
from text in QuotedPair.Or(QdText).Many().Text()
from close in DoubleQuote
select text;

the Or method indicates a choice between two parsers. QuotedPair.Or(QdText) will try to match a quoted pair, and if that fails, it will try to match a QdText instead.

Many() indicates any number of repetition

Text() combines the characters into a string

We now have all the basic building blocks, so we can move on to higher level rules.

Parsing challenge parameters

A challenge is made of an auth scheme followed by one or more parameters. The auth scheme is trivial (it’s just a token), so let’s start by parsing the parameters.

Although there isn’t a named rule for it in the grammar, let’s define a rule for parameter values. The value can be either a token or a quoted string:

Here we match a token (the parameter name), followed by the '=' sign, followed by a parameter value, and we combine the name and value into a Parameter instance.

Now let’s parse a sequence of one or more parameters. Parameters are separated by commas (','), with optional leading and trailing whitespace (look for "#rule" in RFC 2616 §2.1). The grammar for lists allows several commas without items in between, e.g. "item1 ,, item2,item3, ,item4", so the rule for the delimiter can be written like this:

We just match the first comma, the rest can be any number of commas or whitespace characters. We return the comma because we have to return something, but we won’t actually use it.

We could now match the sequence of parameters like this:

private static readonly Parser<Parameter[]> Parameters =
from first in Parameter.Once()
from others in (
from _ in ListDelimiter
from p in Parameter
select p).Many()
select first.Concat(others).ToArray();

But it’s not very straightforward… fortunately Sprache provides an easier option with the DelimitedBy method:

If there’s a syntax error in the input text, the Parse will throw a ParseException with a message describing where and why the parsing failed. For instance, if I remove the space between "Bearer" and "realm", I get the following error:

Conclusion

As you can see, Sprache makes it very simple to parse complex text. The code isn’t particularly short, but it’s completely declarative; there are no loops, no conditionals, no temporary variables, no state… This makes it very easy to understand, and it can easily be compared with the actual grammar definition to check its correctness. It also provides pretty good feedback in case of error, which is hard to accomplish with a hand-built parser.