F Sharp Programming/Lexing and Parsing

Lexing and parsing is a very handy way to convert source-code (or other human-readable input which has a well-defined syntax) into an abstract syntax tree (AST) which represents the source-code. F# comes with two tools, FsLex and FsYacc, which are used to convert input into an AST.

FsLex and FsYacc have more or less the same specification as OCamlLex and OCamlYacc, which in turn are based on the Lex and Yacc family of lexer/parser generators. Virtually all material concerned with OCamlLex/OCamlYacc can transfer seamlessly over to FsLex/FsYacc. With that in mind, SooHyoung Oh's OCamlYacc tutorial and companion OCamlLex Tutorial are the single best online resources to learn how to use the lexing and parsing tools which come with F# (and OCaml for that matter!).

A lexer uses regular expressions to convert each syntactical element from the input into a token, essentially mapping the input to a stream of tokens.

A parser reads in a stream of tokens and attempts to match tokens to a set of rules, where the end result maps the token stream to an abstract syntax tree.

It is certainly possible to write a lexer which generates the abstract syntax tree directly, but this only works for the most simplistic grammars. If a grammar contains balanced parentheses or other recursive constructs, optional tokens, repeating groups of tokens, operator precedence, or anything which can't be captured by regular expressions, then it is easiest to write a parser in addition to a lexer.

With F#, it is possible to write custom file formats, domain specific languages, and even full-blown compilers for your new language.

The following code will demonstrate step-by-step how to define a simple lexer/parser for a subset of SQL. If you're using Visual Studio, you should add a reference to FSharp.PowerPack.dll to your project. If you're compiling on the commandline, use the -r flag to reference the aforemented F# powerpack assembly.

We also have non-keyword identifiers composed of strings and numeric literals. which we’ll represent using the keyword ID, INT, FLOAT.

Finally, there is one more token, EOF, which indicates the end of our input stream.

Now we can create a basic parser file for FsYacc, name the file SqlParser.fsp:

%{openSql%}%token<string>ID%token<int>INT%token<float>FLOAT%tokenANDOR%tokenCOMMA%tokenEQLTLEGTGE%tokenJOININNERLEFTRIGHTON%tokenSELECTFROMWHEREORDERBY%tokenASCDESC%tokenEOF//start%startstart%type<string>start%%start:|{"Nothing to see here"}%%

This is boilerplate code with the section for tokens filled in.

Compile the parser using the following command line: fsyacc SqlParser.fsp

If you're using Visual Studio, you can automatically generate your parser code on each compile. Right-click on your project file and choose "Properties". Navigate to the Build Events tab, and add the following to the 'Pre-build event command line' and use the following: fsyacc "$(ProjectDir)SqlParser.fsp". Also remember to exclude this file from the build process: right-click the file, choose "Properties" and select "None" against "Build Action".

If everything works, FsYacc will generate two files, SqlParser.fsi and SqlParser.fs. You'll need to add these files to your project if they don't already exist. If you open the SqlParser.fsi file, you'll notice the tokens you defined in your .fsl file have been converted into a union type.

This is not "real" F# code, but rather a special language used by FsLex.

The let bindings at the top of the file are used to define regular expression macros. eof is a special marker used to identify the end of a string buffer input.

rule ... = parse ... defines our lexing function, called tokenize above. Our lexing function consists of a series of rules, which has two pieces: 1) a regular expression, 2) an expression to evaluate if the regex matches, such as returning a token. Text is read from the token stream one character at a time until it matches a regular expression and returns a token.

We can fill in the remainder of our lexer by adding more matching expressions:

Notice the code between the {'s and }'s consists of plain old F# code. Also notice we are returning the same tokens (INT, FLOAT, COMMA and EOF) that we defined in SqlParser.fsp. As you can probably infer, the code lexeme lexbuf returns the string our parser matched. The tokenize function will be converted into function which has a return type of SqlParser.token.

Notice we've created a few maps, one for keywords and one for operators. While we certainly can define these as rules in our lexer, its generally recommended to have a very small number of rules to avoid a "state explosion".

To compile this lexer, execute the following code on the commandline: fslex SqlLexer.fsl. (Try adding this file to your project's Build Events as well.) Then, add the file SqlLexer.fs to the project. We can experiment with the lexer now with some sample input:

Let's examine the start: function. You can immediately see that we have a list of tokens which gives a rough outline of a select statement. In addition to that, you can see the F# code contained between {'s and }'s which will be executed when the code successfully matches—in this case, its returning an instance of the Sql.sqlStatement record.

The F# code contains "$1", "$2", :$3", etc. which vaguely resembles regex replace syntax. Each "$#" corresponds to the index (starting at 1) of the token in our matching rule. The indexes become obvious when they’re annotated as follows:

So, the start rule breaks our tokens into a basic shape, which we then use to map to our sqlStatement record. You're probably wondering where the columnList, joinList, whereClause, and orderByClause come from—these are not tokens, but are rather additional parse rules which we'll have to define. Let’s start with the first rule:

columnList:|ID{[$1]}|IDCOMMAcolumnList{$1::$3}

columnList matches text in the style of "a, b, c, ... z" and returns the results as a list. Notice this rule is defined recursively (also notice the order of rules is not significant). FsYacc's match algorithm is "greedy", meaning it will try to match as many tokens as possible. When FsYacc receives an ID token, it will match the first rule, but it also matches part of the second rule as well. FsYacc then performs a one-token lookahead: it the next token is a COMMA, then it will attempt to match additional tokens until the full rule can be satisfied.

Note: The definition of columnList above is not tail recursive, so it may throw a stack overflow exception for exceptionally large inputs. A tail recursive version of this rule can be defined as follows:

joinList is defined in terms of several functions. This results because there are repeating groups of tokens (such as multiple tables being joined) and optional tokens (the optional "ON" clause). You've already seen that we handle repeating groups of tokens using recursive rules. To handle optional tokens, we simply break the optional syntax into a separate function, and create an empty rule to represent 0 tokens.

Altogether, our minimal SQL lexer/parser is about 150 lines of code (including non-trivial lines of code and whitespace). I'll leave it as an exercise for the reader to implement the remainder of the SQL language spec.

2011-03-06: I tried the above instructions with VS2010 and F# 2.0 and PowerPack 2.0. I had to make a few changes:

Add "module SqlLexer" on the 2nd line of SqlLexer.fsl

Change Map.of_list to Map.ofList

Add " --module SqlParser" to the command line of fsyacc

Add FSharp.PowerPack to get Lexing module

2011-07-06: (Sjuul Janssen) These where the steps I had to take in order to make this work.

If you get the message "Expecting a LexBuffer<char> but given a LexBuffer<byte> The type 'char' does not match the type 'byte'"