Writing a Lexer in Java 1.7 using Regex Named Capturing Groups

One of my favorite features in the new Java 1.7 aside from the try-with-resources statement are named capturing groups in the regular expression API.

Although, captured groups can be referenced numerically in the order of which they are declared from left to right, named capturing makes this more intuitive as I will demonstrate in the construction of a lexer.

Lexer, a Definition

To describe lexers, we must first describe a tokenizer. Tokenizers simply break up strings into a set of tokens which are, of course, more strings. Subsequently, a lexer is a type of tokenizer that adds a context to the tokens such as the type of token extracted e.g. (NUMBER 1234) whereas a simple token would be 1234. Lexers are important for parsing languages, however, that is a discussion beyond the scope of this tutorial.

For example, given the input "11 + 22 + 33", we should receive the following tokens from a lexer:

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Note that BINARYOP refers to binary operator. Binary operators includes any operator that accepts two arguments. The archetypal example is addition which accepts two numbers, one to the left and the other to the right of the operator.

Setting-Up the Program

The input is a sentence to be scanned. For this tutorial, we will scan a simple arithmetic grammar that includes addition, multiplication and subtraction. Consequently, we will parse the following input:

Enumerations in Java can only have private constructors because there is only a finite set of objects created at run-time. Consequently, their data fields are frequently declared as final.

Finally, we declare a data structure for holding the token data. Additionally, I will override the toString() method for printing out the token's contextual data at the end of this tutorial in the format I have mentioned earlier: (<TYPE> <DATA>).

Now that we have our input, token types and data structure for tokens, we may begin lexical analysis of the input string into a set of tokens with its corresponding token type.

Lexical Analysis with Regular Expressions

We begin by framing our lexical analysis method as lex(), a function which returns a list of Token objects. Additionally, we will need to import ArrayList in order to store the Token objects into the list.

Now, we need to encode all of the regular expression patterns for each of the token types into a single pattern in the algorithm shown below. This is the case where we use named capturing groups in regular expressions as (?<TYPE> PATTERN) so that once a pattern is matched, we can retrieve the token by calling its group name, the TYPE.

Next, we begin tokenizing by creating a Matcher object from the compiled pattern, tokenPatterns, from earlier. The matcher will return any token matched with any of the corresponding token type patterns. Note that we must also import the Matcher class here.

We will iterate through the list of token types and ask if the token type was matched. If the token returns a match, we will add it to our list of tokens with the corresponding token type and continue parsing the input.

And the algorithm is complete! The magic of named capturing groups here happens as we match of the token types. Note that instead of matching groups by their numerical reference, matcher.group(0), we use the actual name which is far more intuitive and much easier to maintain.

Running the Algorithm

For completeness, when we run the program, we should receive the following output.

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Conclusion

Although lexical analysis is doable without it, named capturing groups in regular expressions certainly makes the code more intuitive and easier to maintain. It's also nice that Java is beginning to provide features that act as syntactic sugar.