Generating Parsers with ANTLR 3

Generating Parsers with ANTLR 3

By Dean Wette, OCI Principal Software Engineer

November 2007

Introduction

Someone hands you a file containing information - data or metadata - and instructs you to implement the facility to read it into a program for processing. As a Java developer, what do you do? In the past, using Java 1.1 or 1.2, parsing involved using general purpose classes and language facilities like String, StringBuffer, StringTokenizer, array traversal, and so on. Starting with version 1.4, the Java regular expression API added power and flexibility to the process. Java 5 improved the situation further with the Scanner API. The common thread in all of this we are constructing a brute force reader. We consume characters from the input and try to make sense of it within some structural context.

There is a better way, and one that is more readily adaptable and tolerant to change. Treat structure of the input as a language, define a grammar for it, and create a parser based on the grammar. A grammar describes a language, but many kinds of structure can be thought of terms of a language, and once the concepts of language description are understood, defining grammars becomes relatively simple. Writing the parser is still the hard part. Fortunately, tools exist that automate the generation of a parser based on the grammar one defines for a language. ANTLR 3 is one such tool.

ANTLR stands for ANother Tool for Language Recognition. It provides a Java-based framework for generating recognizers, parsers, and translators from a grammar. Terrence Parr, a professor of computer science at the University of San Francisco, created ANTLR and continues to develop it actively. ANTLR - along with the ANTLRWorks IDE, documentation, contributions, and a wiki - are provided free to users at http://www.antlr.org. In addition to the official web site, Parr is the author of The Definitive ANTLR Reference, an excellent book on ANTLR (also available as a PDF file).

Domain specific languages

Domain specific languages (DSL) represent the class of programming languages that solve specific problems, unlike general purpose languages such as C++ or Java. DSLs are higher level than general purpose languages and run the gamut from scripting languages to simple configuration file formats. Writing a grammar for a DSL is easier than writing your own brute force parser. When the DSL changes, ANTLR makes it very easy to regenerate new code for parsing and translating it.

ANTLR grammars are based on Extended Backus-Naur Form (EBNF), used to describe recursive grammar rule definitions. The basic form of EBNF is

A : B

where A is a symbol replaced by B. If B is a rule, it is replaced by its right side, until all non-terminal symbols (rule labels, etc) are replaced by terminal symbols (literals).

The following simple grammar from the ANTLR Quick Starter guide illustrates a simple example of EBNF:

utterance : greeting | exclamation;

greeting : interjection subject;

exclamation : 'Hooray!';

interjection : 'Hello';

subject : 'World!';

Notice that rules to the right of the colon - the non-terminal symbols - get replaced by their definitions repeatedly until only the literals (terminals) remain. A somewhat more complicated example comes from the Java Language Specification. While it is not a valid ANTLR grammar, it serves as another good example of BNF.

Statement:

Block

...

StatementExpression ;

...

Block:

{ BlockStatements }

BlockStatements:

{ BlockStatement }

BlockStatement :

LocalVariableDeclarationStatement

ClassOrInterfaceDeclaration

Statement

This case represents an example of the recursive nature of EBNF. The rule for Statement above is defined in terms of itself within the BlockStatement rule. As with all recursion Statement must resolve ultimately to a base case that is replaced by a terminal symbol.

An Example

An application I support that analyses input data requires a metadata file describing what that input contains. The metadata file serves as the bridge to provide meaning to the analysis workflow implemented by the application. The details are really unimportant to this discussion. The application began life using XML as the metadata format. In the early days it was fairly straightforward, and the application included a tool to generate the metadata from sample input data. Using that generated metadata, users added some additional detail that couldn't be extracted from the data itself. As the application evolved the XML metadata grew increasingly complex. Finally the users rebelled, stating the format became inefficient and was too difficult to handle effectively. To make a long story short - rather than spending time creating something else they might not like - the users were asked what they would like to see. They provided a configuration file format based on a legacy tool they used in the past, similar to this example.

# this is repeated any number of times

DEFINE-GROUP NAME=BODF TYPE=VMT COMPONENT=BODY SIDE=ALL

VARIABLES

X UNITS-IN=LBS UNITS-OUT=KIPS TEXT="X"

Y UNITS-IN=LBS UNITS-OUT=KIPS TEXT="Y"

Z UNITS-IN=LBS UNITS-OUT=KIPS TEXT="Z"

END-VARIABLES

STATIONS TEXT="Body Station (in)"

1.0 STAT00

1.1 STAT01

1.2 STAT02

1.3 STAT03

END-STATIONS

END-DEFINE-GROUP

The following grammar defines the rules used to recognize the new metadata format. Rule names to the left of the colons are defined by sequences on the right consisting of other rules and tokens.

grammar MetaDef;

metadef : group+ EOF;

group : 'DEFINE-GROUP'

property* NEWLINE+

variables

stations

'END-DEFINE-GROUP' NEWLINE+;

variables : 'VARIABLES' NEWLINE+

variable*

'END-VARIABLES' NEWLINE+;

variable : STRING property* NEWLINE+ ;

stations : 'STATIONS' property* NEWLINE+

station*

'END-STATIONS' NEWLINE+;

station : STRING NEWLINE+;

property : STRING EQ STRING;

// lexer rules - must start with uppercase letter

EQ : '=';

DIGITS : '0'..'9' ;

LC : 'a'..'z' ;

UC : 'A'..'Z' ;

NEWLINE : '\n'|'\r'('\n')? ;

STRING : (LC|UC|DIGITS|'_'|'-'|','|'.')+ | ('"' (~'"')* '"');

WS : (' '|'\t')+ { $channel=HIDDEN; } ;

The labels in all uppercase define lexer rules. Grammar recognition involves two steps for our example, and possibly more for grammars with greater complexity. The lexer - or process of lexical analysis - converts sequences of input characters into sequences of tokens. Parsing analyses the sequence of tokens to determine grammatical structure.

However, the example given above is not enough to provide a functional solution. It only performs input recognition. The lexer and parser generated by ANTLR for this grammar automatically emit warnings for input that doesn't conform to the grammar, but they don't translate the input into anything usable beyond verifying correctness. Generating output requires another step, and involves one or more of several possibilities:

Generate an Abstract Syntax Tree (AST), which is often needed for situations that require multiple pass parsing.

Use the StringTemplate utility included with ANTLR 3, or for our example

Embed actions into the grammar that ANTLR includes in the generated parser.

Adding ANTLR actions to the grammar provides the translation and interpretation component to the overall workflow; otherwise, we only get a recognizer for the DSL. Embedding Java statements in the grammar directs the ANTLR generator to embed that code into the generated lexer and parser classes. This is similar in concept to how JSP scriptlet code becomes part of the generated servlet code.

grammar MetaDef2;

@header {

package com.ociweb.dsl;

import java.util.List;

import java.util.ArrayList;

import java.util.Set;

import java.util.LinkedHashSet;

import org.apache.log4j.Logger;

}

@lexer::header {

package com.ociweb.dsl;

}

@members {

private static Logger logger =

Logger.getLogger(MetaDef2Parser.class);

private List<Group> groups = new ArrayList<Group>();

private List<Property> controlData = new ArrayList<Property>();

public List<Group> getGroups() {

return groups;

}

} // end of @members

The three sections above illustrate the use of global actions. The @header block defines code that appears in the parser before the class definition. Package and import statements belong here. The @lexer::header block works identically, but for the lexer class generated by ANTLR. The @members block encloses actions for initializing class fields and member methods. Use this section to initialize data used by rule bodies and to define supporting methods, including public methods callable from parser client code.

The rule definitions appear next. Actions found within the rules- any code appearing in blocks enclosed by braces - become local code in parser methods ANTLR generates to handle the specific rule. ANTLR inserts the @init action code after its own initialization code, but before code it generates for the rule body. @after actions (not shown in the example) can also be added to specify code that appears after that generated for the rule body.

// start rule

metadef : (group)+ EOF;

group

@init {

Group group = new Group();

groups.add(group);

}

: 'DEFINE-GROUP'

{ System.out.println("begin group"); }

(p=property { group.addProperty(p); })* NEWLINE+

// can also use group.addProperty($property)

// and not assign to p

vars=variables

{ group.setVariables(vars); }

stats=stations { group.setStations(stats); }

'END-DEFINE-GROUP' NEWLINE+

{ System.out.println("end group"); } ;

ANTLR also supports rule parameters and rule return values for actions. Rule parameters are not presented in the example. When used they effectively create user-defined attributes available the the rule body. Rule returns also create user-defined attributes. Actions in the calling rule use these attributes via label properties. Notice how the action in the rule above assigns the return from the variables rule reference to an attribute (vars) used by an action within the rule body. The variables rule itself instantiates the return value within a rule action of its own, illustrated below.

Building DSLs with ANTLR

Although the ANTLR tool itself provides the ability to generate the final lexer and parser code for Java, it doesn't include a way to integrate that step easily into a build process. While ANTLRWorks (see below) makes it easy to generate the code, it doesn't help with automated builds since it requires user interaction in a GUI. Fortunately, one of the contributions found at the ANTLR web site includes an Ant task for invoking the ANTLR parser generator, and is featured prominently on the front page of the ANTLR web site.

To get started, add the antlr3-task.jar file to the lib directory in your installed Ant. Add the ANTLR jar files for parser generation to your build classpath. For example:

[project]

|- lib/tools/antlr

|- antlr-2.7.7.jar

|- antlr-3.0.1.jar

|- stringtemplate-3.1b1

It's unnecessary to include these files in the runtime classpath. Deployment requires only the runtime jar file (antlr-runtime-3.0.1) in the execution classpath. I define the following Ant path for the antlr3 task.

<path id="antlr.class.path">

<fileset dir="${lib.dir}/tools/antlr">

<include name="**/*.jar"></include>

</fileset>

</path>

The Ant task distribution includes an example project with an Ant build.xml file demonstrating its use. I prefer creating an Ant macrodef with defaults that work for my projects to simplify use (and reuse), as follows:

Using the Ant macrodef is straightforward. Define the following properties, either in a properties file or a build.xml, and invoke the gen-antlr target.

gen.src.dir=gen-src

gen.antlr.dir=${gen.src.dir}/antlr

gen.antlr.package.dir=com/ociweb/dsl

antlr.dsl.grammar=Datadefinition.g

<target name="gen-antlr" depends="clean.antlr">

<mkdir dir="${gen.antlr.dir}/${gen.antlr.package.dir}"></mkdir>

<antlr3-def

antlr.gensrc.dir="${gen.src.dir}"

antlr.grammar.name="${antlr.dsl.grammar}"

antlr.package.dir="${gen.antlr.package.dir}"></antlr3>

</target>

There's really a lot more to explore and learn about ANTLR and parser generation that is beyond the scope of this discussion. The ANTLR web site and Parr's book are excellent places to start. But something that helped me a lot is ANTLRWorks, the grammar development IDE developed by Jean Bovet. Prior to working with ANTLR, language grammars were not an intellectual focus for me, so I was happy to get help with the learning curve associated with creating my first grammar.ANTLRWorks includes a very nice debugger, as well as tools for generating the lexer and parser code in Java, an interpreter, and syntax diagrammer. The grammar interpreter and debugger alone are reason enough to use ANTLRWorks, especially for those new to creating grammars for DSLs.