Human Being, Occasional Smart Person, Just Some Guy

ANTLR3: Inserting Implied Zero-Width Tokens

(ANTLR users searching for an answer, you might skip the prose and head to the listings. They’re meant to be readable outside full context.)

What follows is a toy grammar, a gross oversimplification of one I’ve been working on, used to explore a fairly simple but rather underdocumented possibility in ANTLR: Having a single lexer rule emit multiple tokens (or zero tokens, or an altogether different token than specified by the rule).

Synopsis

This language is called “SomeSx”, and it consists of dumbed-down S-expressions with a special “sigil” modifier, as follows:

A document consists of zero or more elements delimited by whitespace.

An element is an atom, a list, or a sigiled item.

An atom is an unquoted string of letters and digits.

A list is zero or more elements delimited by whitespace, surrounded by ( and ).

A sigiled item is a $that is either

followed directly by an element (with no whitespace between), which is then its topic, or

not followed directly by an element, in which case it has a special empty topic.

One might think of a sigiled item as one that represents either a nullary (if the topic is empty) or unary (otherwise) function. The source

alpha $bravo $ (charlie $() delta $(echo foxtrot) golf $) $

might transliterate, as an example, to the following JavaScript-like expression:

Do note how sigiled is matched: There’s always a Sigil token and then a topic. That topic is either a normal element or some special EmptyMarker to indicate that there is no topic.

This is the problem, of course: In the language described above, the fact that a topic is empty doesn’t correspond to a token but to what regexp fiends would call a zero-width assertion. Before we get to why that’s a problem, I’ll sidestep it for a moment by presenting a lexer for a slightly broken version of the language.

Lexer, makeshift/broken demo version

Imagine that the language described above, instead of determining an empty topic by detecting that whatever follows a sigil is not an element, just has an explicit empty topic token ~. The above input sample would then be written

This works delightfully; alas, it doesn’t lex the language as specified. No, for that to happen, EmptyMarker would have to match not ~ but the empty string (a big no-no for any eventually deterministic lexer), and to only match it in well defined places.

One way to make a lexer rule in ANTLR that matches in exactly this way is to make it a fragment; we could use a syntactic predicate to cause the zero-width token to match only if whatever follows the sigil doesn’t look like the start of an element. It’s actually very easy.

Or not. The rule’s fragment-ness is a double-edged sword in this regard: Matching it does not emit any sort of token that can be used by the parser. Therefore, we’ll have to convince the lexer rule containing the fragment to emit it alongside any token it might have produced anyway.

And it can be done. There’s just some plumbing to do first.

Lexer support for arbitrary numbers of tokens per rule

Little aside. (And again, do skip it if you’re in a hurry. It’s hypocritical; I’m providing unnecessary information into an article I wrote out of frustration with information overload.)

Emitting arbitrary numbers of tokens from a lexer rule is actually a well-known problem in the ANTLR world, if not only because Python is so popular. In that language, the braces one would see in a C-like language are eschewed in favor of indentation changes. While that’s nice for the user (ymmv), it’s tricky for the lexer, which has to be able to spit out all sorts of made-up “indent” and “dedent” tokens resulting from monitoring the indentation state.

After running into much information overload on the subject, I flipped open my dead wood copy of The Definitive ANTLR Reference and found this gem around pages 110-111:

Note that, for efficiency reasons, the CommonTokenStream class does not support multiple token emissions for the same invocation of nextToken(). Read the following from class Lexer when you try to implement multiple token emission:

When I read this, I truly had to fight the urge to throw the tome hard across the room. ANTLR’s finest documentation, lovingly crafted by the creator, is unfortunately (but justifiably) not free information. I have the thing itself in my hands, and it not only tells me that I’m going to have to override some methods to get it to work the way I want, but then has the nerve not to offer a concrete implementation and that I’m just going to have to “try to implement” it.

Additionally, a comment in the code snippet offers this:

Currently does not support multiple emits per nextToken invocation for efficiency reasons.

It might not need to be default, but I’d think this is a common enough thing that it might be simple (and more descriptive and easier to carry across target platforms) to have some settable flag for multiple emit support rather than requiring ugly ad hoc overrides.

Of course, this was at the end of a night of fruitless searching for an answer, and it’s a wonder nobody was killed.

Fortunately for humanity itself, I was finally able to extract the core nugget of knowledge from a wiki page that had quite a bit of unrelated (to me, anyway) surrounding information hiding it from plain sight.

Near the end of the text itself was the treasure: The overrides for emit(Token) and nextToken() that I needed. Sadly, it was for an earlier version of Java and of ANTLR, but I was able to port code from a more recent post by a user of the C# target.

Without further ado, the very simple guts of a multi-emitting lexer, for Java 1.6 and ANTLR 3.4:

Modifying the lexer to produce the marker token on cue

Here, we just follow through with the modification ideas from earlier: Have the Sigil token perform as before if, and only if, it detects that it is immediately followed by an element. If not, have it match a zero-width fragment token, then emit the Sigil followed by the EmptyMarker tokens. The details are provided inline:

// Lexer changes for implicit empty topic
// EmptyMarker is changed to match a zero-length string. In ANTLR (at least), a
// zero-width lexer rule is practically useless (it could match anywhere, and
// it's unclear where it's supposed to) unless it is a fragment. As a fragment,
// it matches exactly where we put it. But as a fragment, it doesn't emit a
// token...without some effort (read on).
fragment EmptyMarker : ;
// This fragment reflects the beginning of any explicit sigil topic: the first
// token in an atom, list, or sigiled. Used in a syntactic predicate, we can
// determine whether a token we're matching is followed by an explicit,
// non-empty topic.
fragment StartOfNonEmptyTopic : Down | Atom | Sigil ;
Sigil : d=Dollar
// The Sigil token is changed to do what it did before as long as the token
// appears immediately before what looks like a non-empty topic.
( (StartOfNonEmptyTopic)=> // already does the right thing; add nothing.
// But if no such token follows, we have to output the sigil (as normal)
// but then also insert a zero-width EmptyMarker afterward.
| z=EmptyMarker // not followed by explicit topic; insert an empty one
{
// The tokens are fairly fluid; several of their properties can be
// changed at will. In this case, the type of a token can be
// changed while retaining its text content, line and column
// numbers, and so forth.
// From what I can tell, using emit(Token) directly will prevent
// any token being emitted by the rule itself. So, if you use it
// once for a fake token, be prepared to use it for the real one as
// well.
$d.setType(Sigil);
emit($d);
// You might have noticed that $z is already a EmptyMarker. Why set
// the type? From looking at the generated lexer code, it appears
// that a matched *fragment* rule creates a token with not the
// named type but an unmatchable dummy token type. However, the
// fragment token type (or any other token type) can be stuck back
// onto it to make it work.
$z.setType(EmptyMarker);
emit($z);
}
)
;

Peripheral caveats

Mostly not specific to this issue, but in general for ANTLR: Mind your generated code. Lots of stuff that might be wrong with your lexer will produce useless, misleading, or even absent diagnostic information. It’s not your fault, and fixing it might only require some changes in convention. For example:

I think I’m going to stick with CamelCase lexer rules, if only because ANTLR itself reserves a few all-caps ones such as DOWN and UP.

The aforementioned wiki page contains the phrase d=DIGITS r='..'; when I tried to match the dollar sign as a quoted string (i.e. d='$') the resulting variable in the generated lexer was an int rather than a Token. If you were curious why Dollar is its own fragment, this is the reason. (Perhaps double quotes would have worked—it wasn’t the first thing I tried, so I didn’t.)

Finally, I have to make sure to meditate and do my best not to be quick to anger toward Terence Parr et al. Language makes for really, really difficult computational problems, and despite the vast collection of warts in JARs, I do think ANTLR really is the best we’ve got, and I hate to consider the needless loss of life were I to attempt much of what I try to do with flex and bison.

The whole thing

Here is how everything fits together. Most comments have been removed to drive home how simple this whole thing is really supposed to be.