Recent Posts

Links

Long time, no see

Well, it's been longer than I would have liked since the last post. As usual in this industry, things always take longer than you expect. I'll not bore you with the details. So let's get on to the cool stuff.

EBNF

I've been writing about parsing for the last couple of posts. This always involves building up a grammar of some sort from smaller grammars and then passing the full grammar to a function to produce a parser. The point I've been trying to make, and will drive home in this post, is that the grammar itself is just a data structure. Deriving a parser from it is just one way to interpret it. Here are a couple of others.

But first, a digression.

It's convenient to have a way to write out a grammar in a concise, readable way that clearly specifies what strings are valid. The EBNF is a very useful tool to have. Click on the link to read up on it.

One thing you can do with a grammar in Toccata is generate the EBNF for it. This can be done with this library. You can take any grammar and pass it to the produce-ebnf function and it will spit one out. Using this library, the EBNF for the JSON grammar we looked at last time is

I'd originally intended to explain how the EBNF generator works, but as I reviewed it, I realized I would have to write about 5 posts and I have other things I want to get to. Maybe down the road, I'll come back to that. But the important thing I want to communicate is that a grammar in Toccata is just a data structure and they can be used in many different ways. For instance, here is the EBNF for the Toccata language itself. (I just realized I should put this in a documentation directory.) This document is not the specification of the parser, that's here. But it's generated automatically from the specification, so it's always accurate as of the time it was generated. How many other languages do you know that have an EBNF you can examine to see if an expression is valid? Most of the time, whatever the parser accepts is what's considered valid and there are almost always some weird corner cases.

A picture is worth ...

I forget when I discovered it, but this article was written in 2007 and deals with parsing, regular expressions and Finite State Automata. I can not recommend it, and the three that follow it, highly enough. For present purposes, I want to draw attention to the diagrams scattered throughout. These show the different states a parser can go through as it reads each character of the input string. These diagrams can be very illuminating in a way the textual representation of the EBNF isn't. This is similar to the Syntax Diagrams on the JSON.org home page. Except the Syntax Diagrams and EBNF show the structure of the grammar. The state machine graphs show the states of the parser. Wouldn't it be nice to be able to generate these graphs automatically?.

Using that library, here's the graph for the JSON grammar we've been looking at.

This did require some updates to the json.toc grammar. How it works is the 'grammar-graph' library converts the grammar to a Graphviz DOT file. This file is then fed into the dot command to produce an SVG or PNG image. Slick!

The Toccata code is located in the 'examples' directory alongside the 'json.toc' file:

So that's three ways we can interpret a grammar. The only limit to others is your imagination or need. For instance, it would be easy to generate a regex from a grammar.

An interesting thing happened on the way to the show

I have to say, the code to generate the graphs is not easy to grok. It's pretty deep and there are several requirements that all have to be satisfied simultaneously. A real nightmare. Which is why it's been a week (or more) since the last post. However, as I was writing that code and testing it against the JSON grammar, I noticed something.

Can you tell which state graph is incorrect? By looking at the state graph, I found a bug in my JSON grammar which I subsequently corrected. Having different ways to look at things is really useful.

Custom parsers

One final point. Being able to easily specify a grammar and generate a parser from it allows you to create custom parsers. Say you want to use JSON to serialize some data structures from your code. Instead of using a generic JSON parser library, you can write a custom parser that only accepts an object value and limits the key strings to small set of strings. This custom parser is probably going to be much faster than a generic parser that parses the JSON into a hash-map which you then have extract the fields from. And, depending on the parser generated from the custom grammar, could give more useful error messages when it fails.

We'll wrap it up there, but there's some exciting things just ahead. I'm writing these posts to show how Toccata can be used for practical purposes and we're going delve into another format next time.

The JSON Spec

As promised, we're going to write a JSON parser today. With a couple of important caveats which I'll cover later. Before reading further, open this link in another tab. It's the official json.org site and has the JSON grammar specified as syntax diagrams.

Now for the actual grammar. I mean, how hard could it be to parse JSON?

The JSON grammar

Well actually, it's not that hard. We've already covered almost everything we need and with the addition of one more concept, we'll be golden. This post will mostly be a bunch of code with relatively little exposition.

When reading the JSON syntax diagrams, there's an assumption that is not explicitly stated; The use of whitespace to separate various pieces. So our first task is to account for that.

The first actual entity we want to describe are numbers. But since Toccata doesn't have support for floats yet, we're going to ignore them. A JSON integer can be either positive or negative, so we'll describe positive ones first, then account for negative ones.

Both the 'object' and 'array' structures from the JSON spec can nest inside each other. So arrays can contain values which are arrays or objects and likewise for objects. But there's no way for a grammar definition to refer to itself by name because they are strictly values and must be fully specified before being assigned to a name. Officially, these data structures are recursive. Not only that, they are mutually recursive. If you read the spec for arrays, you see that they can contain any number of values. And a value can be a string, number, object, array, etc. Therefore, before describing the grammar for array, we have to describe value, but we can't describe value without array. What to do, what to do?

The solution is to 'forward declare' a placeholder for value and then replace it with the real grammar later. Here's how you do that.

(def value (grmr/recurse "value"))

The string passed to grmr/recurse must match the string we'll use later in defining value. It doesn't have to be the same as the symbol we're assigning this to, but that reduces confusion. If Toccata had a macro system (and it will eventually) this kind of boilerplate would be handled in some kind of defgrammar macro or something.

There's a few funky things to account for the peculuarities of the JSON array spec, but it works. I'd suggest putting print-errs in various places and parsing some JSON to gain a better understanding of what's going on.

The first argument to grmr/recursive-rule must match the parameter passed to grmr/recurse above. In fact, they don't even have to be strings. Just two values that can be compared for equality. The second argument is the actual grammar that describes a value. In addition to what we've seen described above, there are 3 anonymous grammars for some JSON literal values.

And there you go, a grammar that describes JSON. Now create a parser from it.

(def parser (rd/parser value))

And you're good to go. I'll leave slurping in a file and parsing it for you to do on your own. Have fun!

EBNF

Syntax diagrams are not the only way to specify grammars. I'm planning on covering a different way next time. Along with something that I hope will be a mind blowing revelation. Or at least a little surprising.

We're rolling now

I don't know about you, but I'm really pleased at how this blog series is developing. Each post seems to be building on the last by adding a couple of new pieces. I would love to say I intended it that way, but I can't.

In this post, we're going to begin looking at file I/O in addition to parsing.

More than command lines

So far, we've seen how to parse the command line arguments passed to our main function. This could be extended to create a full blown arguments parsing library with long and short options, etc. But at some point, we're going to need more than what can be conveniently passed in on the command line. We're going to want to keep a file full of configuration information that will be read at each invocation and its contents used to inform our program's execution.

This turns out to be just a tiny step beyond what we've already seen. I'm of the opinion that config files should be relatively simple. If you need to use JSON or XML for nesting complex data structures, you're getting close to needing a full-blown DSL. And what you're writing is more of an interpreter than just a program with a config file. So for our purposes, we're going to limit the lines of a config file to

empty lines (only containg spaces and tabs)

comment lines beginning with '#'

value lines that have a name (alphabetic characters and '-') and a value (integers or strings delimited by '"')

And now, the rubber meets the road. We need to combine these pieces to parse a config file line and create a data structure of the name string and config value. A moment's thought surfaces some requirements.

We don't know which names will be defined in the file

We don't know what order they will be defined in

We want to end up with a HashMap of all names to values in the file

There are multiple ways to do this. My preferred way is to put each name/value pair into a HashMap and then just compose all the maps after parsing all the lines.

Each line gets converted to a HashMap. But we need to ignore the comment and empty lines. So we define a 'higher order' grammar rule. (Really, it's just a function that takes a grammar rule and returns a modified version.) In this case, ignore takes a grammar a rule and returns a rule that always returns an empty HashMap upon a successful match. This is different from and does not replace grmr/ignore from the grammar library.

And then, it's straightforward to define our config file grammar. It's just none (since the file may be empty) or more lines. Each line may be either a comment or an empty line, which produces an empty HashMap. Or a configuration name/value pair.

This list of HashMaps is passed to the anonymous function as the parameter config-lines. Then we use the parametric form of comp, the comp* function to squash all the HashMap into a single HashMap. This also means that if a config name is defined multiple times in a file, only the last one will appear in the final HashMap.

I'm going to leave the explanation of the main function out. You should be able to read this if you've read the previous posts. I will say that fio/slurp just pulls the entire contents of a file into a string in one shot.

Better than a library

In other languages, there might be a library or package that you would import to handle config files. Open source is great, but when you pull in a package like that, you take on the responsibility for keeping it integrated with your application and that comes at a cost. What I've tried to show here is how easy it is to have custom software to do exactly what you need.

For example, it would be trivial to only allow certain names to be in the config file. Or add different kinds of values. I think this is much superior to importing a bunch of libraries to do relatively simple tasks.