As you can see the tokens are defined as a series of characters.Each character itself in the defintion actually represents both the upper case and the lower case version of the character (which makes the keywords case insensitive).Some tokens have multiple definitions, possibly using wildcards. These extra definitions are there for the tokens that may be abbreviated to 4 characters as is common in most xBase languages. The question marks indicate optional characters. We have separated the abbreviated defintions and put a logical condition (predicate) in front of them so we can enable/disable these abbreviations at a central location.

The STRING_CONST token shows that we can use wild cards and OR operators. A String Constant in this defition is a double quote followed by any number of characters that is not a double quote, CR or LF and terminated with a double quote.

Rule definitions for the Parser

The parser rule above defines the function rule. It specifies that a function consists of an optional list(*) of Modifiers (STATIC, PUBLIC, INTERNAL etc) followed by a FUNCTION keyword, followed by an identifier and an optional(?) parameter list, an optional(?) result type, an optional Callingconvention(?) and finally an End of Statement (NewLine) followed by a statementBlock, which is a list of statements.

These defintions are often also shown as diagrams such as the ones below:

All in all this can become quite complex. The current definition files of X# are a Token definition of 550 lines and a rule definition of more than 700 lines, and we are not there yet.

From these definitions ANTLR will generate source code. This source code is quite easy to read. For example the function rule above results in the following source code:

As you can see this code is quite easy to read. TokenStream.La(1) looks 1 token ahead in the stream of tokens produced by the Lexer.

What's next

For the X# compiler the generated parse tree is an intermediate step in the compilation process. This parse tree is inspected using a (hand written) tree walker which helps to generate a Tree structure that the Roslyn system understands. From there we can then use the complete Roslyn infrastructure to do read Metadata, do Symbol Lookups, bind our tree to the imported types and then finally generate IL code.

How this is done exactly is outside of the scope of this article.

I hope this has helped you to form an idea of how our language recognizer works.