Building a custom lexer

As a software engineer I spend all day (hopefully) writing code. I love code and I love that there are languages that help me solve problems and create solutions. But as an engineer I always want to know more about the tools I work with so I recently picked up “Language Implementation Patterns” by Terence Parr and decided I was going to learn how to build a language. After reading through most of the book and working on examples for about 5 weeks I ended up building an interpreted toy general purpose language that has features like:

type inference

partial functions

static typing

classes

first class methods

The language I wrote is by no means production level. It’s a toy, and it can do toy things. It can’t do IO, it can’t talk to the OS, it doesn’t do GC, but it does evaluate expressions, call methods, basic exception handling, dynamic typing, and lets you do simple work.

Now that I’m done (but never finished) with the project I really appreciate how much goes into making a working language. The whole project increased my respect ten fold for people like Don Syme, Tao Liu, Anders Hejlsberg, Bjarne Stroustrup, Guido van Rossum, and countless others who have made REAL languages. But now that I’m about done with my project, I wanted to share how parts of my language were done.

Tokenizing

The first step to implementing a domain specific language, or even a general purpose language, is to figure out what the string representing the program means. You have to figure out that something like this:

To do this you need to first break up the program into a series of strings that you know are independent tokens, such as “int”, and “foo”. This is called tokenizing. Then you then need to go over that list and say “oh hey, I found this token ‘int’ but really it means TokenType.Int“. Converting the strings to strong types is called lexing. Translating the string into strong types makes it significantly easier to work on it later.

While some grammars can be tokenized by breaking up on whitespace, for my language I found it easier to tokenize character by character. This means I don’t have to worry about whitespace or special characters or any other kinds of weird delimiters. Granted, it’s more resource intensive, but for a toy language that’s OK. There are some great libraries out there that will let you define grammars for your language and will prebuild you lexers, tokenizers, parsers, tree walkers, etc (such as ANTLR) but it can be fun doing this by hand the first time.

To tokenize my grammar I created a base class called TokenizableStreamBase that takes a function to generate a list of items and gives you some functionality to work on that stream. The base class lets me

Consume an item. This means I’ve processed whatever is at the current index and we should move to the next item

Snapshot the stream. I can keep track of where I am and if something happens later I can rollback to this point in the stream.

Commit the stream. If I made a snapshot just discard it and continue from where I am now.

Let me give an example of why you want to use snapshots. Pretend you are trying to match the word “string”, but the current tokenizer has “strint”. Going character by character you would successfully match the word “strint” up until the last character. At that point you can tell it’s not a “string” so you can roll back the stream to the start of the word and try something else.

In a later post I’ll describe how I built my parser, which is responsible for creating abstract syntax trees and logical validation of the code. The parser re-uses the tokenizer stream base and instead of using characters (like the tokenizer stream) uses a stream of tokens.

Lexing

A tokenizer is pretty useless though without creating meaningful tokens out of that source. To do that you need a way to match the characters to an expected literal. In my language I’ve created the concept of a matcher that can emit an object of type Token. Each matcher will be able to take the current tokenizer, consume characters until it finds what it wants, and if it found its match type, return the strongly typed token.

So, lets say I want to match the word “int” to token Int I’ll have a matcher that knows it wants the string “int” and it will go through the lexer’s current position matching while things are matching, or bailing if it encounters a character that it didn’t expect. Being able to roll back a snapshot is called backtracking. Some grammars are clear enough to never have to do this. For example, if you will always have words that start with “i” and they are ALWAYS supposed to be “int” then you don’t need to backtrack. If you fail to match then the syntax is invalid.

This takes a Tokenizer and hands the tokenizer to whatever subclasses the abstract base for the match implementation. The base class will make sure to handle snapshots. If the matcher returns a non-null token then it will commit the snapshot otherwise it can roll it back and let the next matcher try.

As an example, here is the matcher for whitespace tokens. For whitespace I don’t really care how much whitespace there was, just that there was whitespace. In fact in the end I discard whitespace completely since it doesn’t really matter. If it found whitespace it’ll return a new token of token type whitespace. Otherwise it’ll return null.

The next matcher (below) does the bulk of the work. This one finds built in keywords and special characters by taking an input string, and the final token it should represent. It determines if the current stream contains that token and then emits it. The idea is if you know “int” is a built in type, and it should match to some TokenType.Int, you can pass that info to a MatchKeyword instance and it’ll find the token for you if it exists. The Match property contains the raw string you want to match on, and the TokenType property represents the strongly typed type that should be paired to the raw string:

It’s a lot, but at the same time it’s not nearly enough! Still, having a single matcher that can match on known tokens means extending the language at this level is quite easy. Here is the construction of the list of my matchers. Order here matters since it determines precedence. You can see now that to add new keywords or characters to the grammar only requires defining a new enum and updating the appropriate match list.

private List<IMatcher> InitializeMatchList()
{
// the order here matters because it defines token precedence
var matchers = new List<IMatcher>(64);
var keywordmatchers = new List<IMatcher>
{
new MatchKeyword(TokenType.Void, "void"),
new MatchKeyword(TokenType.Int, "int"),
new MatchKeyword(TokenType.Fun, "fun"),
new MatchKeyword(TokenType.If, "if"),
new MatchKeyword(TokenType.Infer, "var"),
new MatchKeyword(TokenType.Else, "else"),
new MatchKeyword(TokenType.While, "while"),
new MatchKeyword(TokenType.For, "for"),
new MatchKeyword(TokenType.Return, "return"),
new MatchKeyword(TokenType.Print, "print"),
new MatchKeyword(TokenType.True, "true"),
new MatchKeyword(TokenType.False, "false"),
new MatchKeyword(TokenType.Boolean, "bool"),
new MatchKeyword(TokenType.String, "string"),
new MatchKeyword(TokenType.Method, "method"),
new MatchKeyword(TokenType.Class, "class"),
new MatchKeyword(TokenType.New, "new"),
new MatchKeyword(TokenType.Nil, "nil")
};
var specialCharacters = new List<IMatcher>
{
new MatchKeyword(TokenType.DeRef, "->"),
new MatchKeyword(TokenType.LBracket, "{"),
new MatchKeyword(TokenType.RBracket, "}"),
new MatchKeyword(TokenType.Plus, "+"),
new MatchKeyword(TokenType.Minus, "-"),
new MatchKeyword(TokenType.Equals, "="),
new MatchKeyword(TokenType.HashTag, "#"),
new MatchKeyword(TokenType.Comma, ","),
new MatchKeyword(TokenType.OpenParenth, "("),
new MatchKeyword(TokenType.CloseParenth, ")"),
new MatchKeyword(TokenType.Asterix, "*"),
new MatchKeyword(TokenType.Slash, "/"),
new MatchKeyword(TokenType.Carat, "^"),
new MatchKeyword(TokenType.Ampersand, "&"),
new MatchKeyword(TokenType.GreaterThan, ">"),
new MatchKeyword(TokenType.LessThan, "<"),
new MatchKeyword(TokenType.Or, "||"),
new MatchKeyword(TokenType.SemiColon, ";"),
new MatchKeyword(TokenType.Dot, "."),
};
// give each keyword the list of possible delimiters and not allow them to be
// substrings of other words, i.e. token fun should not be found in string "function"
keywordmatchers.ForEach(keyword =>
{
var current = (keyword as MatchKeyword);
current.AllowAsSubString = false;
current.SpecialCharacters = specialCharacters.Select(i => i as MatchKeyword).ToList();
});
matchers.Add(new MatchString(MatchString.QUOTE));
matchers.Add(new MatchString(MatchString.TIC));
matchers.AddRange(specialCharacters);
matchers.AddRange(keywordmatchers);
matchers.AddRange(new List<IMatcher>
{
new MatchWhiteSpace(),
new MatchNumber(),
new MatchWord(specialCharacters)
});
return matchers;
}

Testing

Lets see it in action in a unit test. Keep in mind the tokenizer and lexer do only the most basic syntax validation, but not much. It’s more about creating a typed token stream representing your code. Later we can use the typed token stream to create meaningful data structures representing the code.

Thanks John, glad you found it useful. The next post on parsing is coming soon, I already have a working draft :). After that I plan on discussing scope and type resolution, syntactic extras like partial functions, resolving forward references, and the interpreter (which supports closures)

Hi, this is an excellent article, thank you for writing it. I had a question about a specific case and wasn’t sure how to handle it or if this method will handle it. My “language” is incredibly simple in that it only contains three possible token types: string, variable, equation. A variable is contained in square brackets “[Variable]”, an equation is the same but begins with an equal sign “[=Equation]”, and everything else is a string. I’ve managed to get this working using your lexer, however; I want to be able to allow nested variables as such: “[Foo[Bar]]” where the name of the outer variable is dependent on the value of the inner variable. So if [Bar] = “tball” then the outer variable will be [Football].