LALR Parse Table Generation in C#

Introduction

Table based parser generation offers the possibility of both fast and flexible parser construction. This article describes an implementation of a particular method of constructing a parse table for an LR (left to right bottom up) parser called an LALR parser or Lookahead-LR parser.

Background

An LR parser consists of a parse table and a stack. A parser of this kind recognizes a string of input tokens by examining the input string from left to right and performing one of the following operations:

Shift the token onto the stack

Reduce several of the tokens on the stack by replacing the tokens on the right hand side of a grammatical production, which must exist on the stack with the token on the left hand side of the production

'Accept' the input string

Produce an error

The LALR algorithm produces a parser table that decides on the possible reductions from a given state using a concept of look-ahead. The algorithm examines the productions that lead into and out of each state via both transitions and grammatical productions, and determines, for each production represented in the state, which tokens could come after that production.

Using the code

The code I've published creates a parse table, and formats the parse table to the console output.

Parser class

The Parser class encapsulates the LALR table construction algorithm, but also exposes several methods and properties useful for grammar analysis and debugging.

Parser - Constructor

Analyses the grammar producing the parse table, using the below methods, and other supporting methods.

GenerateLR0Items

Generates the LR(0) States of the parser by starting with the State S' -> .X where X is the start symbol of the grammar. The production S' -> X must be explicitly passed into the constructor above.

LR(0) items consists of the set of items {S' -> .X} and all of the states reachable by substituting the very next symbol after the '.' (X in this case) with the left hand side of any production that has X on the right hand side. This operation is called the closure of an LR(0) item. In the grammar above, State 0 consists of the following set of items.

We then find the states Goto(X) for any X which is a token of the grammar and find the successor states of state 0. This is done by including each item with the 'X' to the right of the '.' and putting them into the new state, then calculating the closure of that new state. On the token e in the grammar presented, the new state, state '1', has the following LR(0) items:

S' -> e .
e -> e . * e
e -> e . / e
e -> e . + e
e -> e . - e

The method repeats this process until there are no new states.

ComputeFirstSets

This method computes the set of terminals that are possible to occur after any token in the grammar. The first set of the token e is {i, (}. This construction makes it possible to compute the LALR states later.

CalculateLookAheads

Calculates the look-ahead items for each LR0 item from GenerateLR0Items above. A looka-head at the end of a production in the state tells the parser that it is safe to perform a reduction. For example, state 2 above on token ) is able to reduce by rule e -> i thus replacing an i on the stack with the non-terminal e. The parser knows it can do this because the LALR state 2 contains the LALR Item e -> i. , ) for production e -> i and look-ahead token ).

GenerateParseTable

This method constructs the actions of the parse table. It does this by combining the Goto actions from each state and the reduction actions possible depending on the look-aheads generated in the previous method. If at any state there is a conflict between either a goto or a reduction, the algorithm attempts to resolve this by choosing a rule from a higher precedence group. If there is no clear winner, then the algorithm checks whether the rule should produce left-most or right-most derivations. A left most derivation will favour a reduction, whereas a right-most derivation will favour a goto. If there is still no clear winner, the algorithm will announce either a Reduce-Reduce or a Shift-Reduce conflict error.

Debug class

The Debug class in the sample code contains several methods that might be useful for debugging a grammar or parser.

DumpParseTable

Writes a formatted parse table to the console.

DumpLR0State

Writes an LR(0) state to the console. For example, the following snippet will write State 0 above to the console.

Debug.DumpLR0State(parser, parser.LR0States[0]);

DumpLALRState

Writes an LALR state to the console, including look-aheads. The following snippet will write the LALR items in State 0 of the generated parser to the console.

Debug.DumpLR1State(parser, parser.LALRStates[0]);

References

The project code implements the LALR algorithm described in the dragon book "Compilers: Principles, Techniques, and Tools" (1986) by Aho, Sethi, and Ullman.

Feature backlog

I intend to implement the following features in later updates.

Generate a C# type that will parse an input grammar

Parse an input grammar from a file

ComponentModel/Reflection invocation of types/methods that perform the reduction rules of the grammar

Share

About the Author

I've spent time programming for the security industry, video games, and telephony. I live in New Zealand, and have a Bachelor of Computing and Mathematical Sciences specializing in Computer Technology, from the University of Waikato in Hamilton, New Zealand.

Comments and Discussions

Tired of implementing LALR(1) parser generation over and over in different languages years ago, I found your code! The basics. Just what I needed to write a nice, simple incremental parser for a calculator. Thanks!

Hello Phillip, thanks for your great work on this algorithm. I like your remarkable concise implementation of it.To actually make use of it, I started to implement parsing of tokens and made some refactorings (some suggested by Resharper) and made more use of structs (to conserve memory).I've added these on GitHub for easy sharing: CodeProject.Syntax.LALR on GitHub

Well the popularity doesn't seem very surprising to me .I've dabbled with GOLD parser before, but as that seems rather abandoned and nor is open source,I was happy to find your implementation of the LALR algorithm. Next step would be hand-writing a grammar and scanner for a grammar description language and then bootstrap with that. I mainly want to try out the new (Roselyn) code generation, s.th. a lexer and grammar description can be compiled into a .NET assembly.Nonetheless I think your other articles are also quite a good read with useful practical implementations. From your blog I see you also play around with Prolog. To write an LALR parser for Prolog is quite challenging since Prolog allows to add operators and stuff, so it doesn't have a fixed grammar per se. I'll take a look at your implementation and see if I can figure that out.

1. According to Algorithm specified for LALR parser in The Compilers by Ullman book, in the pasing table the non-terminal column should contains only intgers such as 1,2,3..etc.,rather than S1,S2,S3..etc., because shift operation is not performed for non-terminals.So, please suggest me a modified code so that in parsing table the non terminal column contains only the integers.

2.will you please tell me the function you are using in the code to combine the grammer having the same the productions.

Hi Ranjith, thanks for your question.1) Nonterminals don't end up in the FOLLOW set of an LR(1) item because they are never listed in the input sequence, so if you can determine that you are looking at a nonterminal (i.e it is not a token that will be listed in the input string) then you don't have to account for the possibility of a reduce action (which is provoked by an input token, and always performed before shifting that input token) this means that you can store the action for a nonterminal differently to a terminal. The authors of the compilers book choose to distinguish between the two but you don't have to.

2) I think you mean combining the rules that have the same kernels. This is done in algorithm 4.11 'An easy but space-consuming LALR table construction'. I implemented algorithm 4.13 'Efficient computation of the kernels of the LALR(1) collection of sets of items'. But to answer your question you need to compute which production/position pairs belong in the 'kernel' of your state. The relevant method is ConvertLR0ItemsToKernels - bear in mind that the value nLR0Item represents an LR(0) item - with no lookahead, and not an LR(1) item.

Basically when you're trying to compare the kernels of two states you want to only use the very base rules, and not the rules derived from the closure of the items. You can trim a state back by discarding any rule/position item in which the read pointer (represented here as '.') is before the first token of the rule for example, you would discard an item that looked like "Y -> . x y z" but not one that looked like "Y -> x . y z" the only exception is that the start rule is allowed in a kernel.

I hope that's helpful.

If you're still having trouble, or you're planning to implement this yourself, the concept to get your head around is the closure operation in the LR0Closure and LR1Closure methods, the rest should follow.

On a side note, the reason that you'd want to use the more efficient algorithm is that LR(1) parse tables, while being able to represent more grammars, can have quite a few more states, and can take up a lot of storage.

According to the concept of LALR Parser the grammer producing the same productions in the subsequent steps must be combined and inserted into the Parsing table. So that size of the table will get reduced.so plz help us to how the states having tne same productions should be combined and inserted into the table in the code.Also how to print the subsequent steps into the console using the code given. Plz upload the LALR parser if implimented in C/C++.

The algorithm that I've implemented here is presented in the "Compilers: Principles Techniques and Tools" book by Aho, Sethi, and Ullman. Pages 236 through 244. Basically there are two algorithms presented: one that generates the entire LR(1) table, and then merges states (I think that's what you want) and another that generates the LR(0) states and then expands them by computing the lookaheads at each state - which is the algorithm I implemented. Basically you need to detect when the LR(1) items can be merged. This happens when the LR(1) items, which consist of a set of rule/position states augmented with the appropriate following terminals for each rule - these are able to be merged if the two LR(1) items consist of the same core/kernel LR(0) states, and I'm taken to understand that if the grammar is LALR there will be are no Reduce/Reduce conflicts in the resulting table.

As I said, I haven't implemented the algorithm that you're looking for.

In addition to the information in the above reference you might also be interested in the reference another commenter posted, - the "Parsing Technique" book (second edition), they describe DeRemer-Penello algorithm in detail. The commenter also posted a link to their code - http://irony.codeplex.com/[^] which is also in c#.

I hope this helps answer your question, but you will need to do some research to answer it more fully.

Respected sir, I am using microsoft visual studio 10 with windows 8 as an operating system. I am getting trouble in running the downloaded LALR project source code. So please help me to get rid off the problem. I am getting the error like this "A project with an output type class library cannot be started directly.In order to debug this project to this solution which references the library project.Set the executable project as the start-up project". I am wating for your reply. Please replay soon...........(within 2 days as possible).

Hi!Yeah it sounds like you've made a class library, which compiles down to a *.dll file - these don't execute on their own, instead you're going to want to put the code into a Console Application, which will compile down to a *.exe file.

As far as I recall, the code I shipped was a console project: the entry point is Main in Main.cs

Hi There,The source is in C#, and builds into a Console Application. I have a mac, so I downloaded the Mono Framework and built it in MonoDevelop - both are free, but it should also work in Visual Studio - I believe that Microsoft makes their Express offering free.

The app I've posted is just a demonstration of parser table generation, and extra steps are required to actually parse an input string.

I am a learner, tried Irony and Gold, used them but never could fully understand what was the underlying concept. Maintaining respect for both those projects, I would like to see more articles like this explains the concept simply.My vote of 5.

In fact, in 90s there was a whole battle of algorithms, between several academia folks, arguing whose alg is better. Look at Parsing Technique book (second edition), they describe DeRemer-Penello algorithm in details. It is waaay more efficient than the one in dragon book. Irony (irony.codeplex.com) implements this, with a few modifications.

hey, yeah I looked at your sample code. I like how the c# code for the BNF uses operator overloading so it looks like a grammar, that's sharp! I also spent a while just looking around your other code. It's very tidy. I think irony is a very cool project.

It already has implemented engines in most of the language's you're likely to use in the near future. Plus, it has a builder tool that lets you specify the input grammar as text and it lets you test it and generate some skeleton code based on a provided template language.

Hey, yeah thanks for the vote Dave. Yeah I noticed that Gold Parser implements the LALR parser table algorithm, and also creates Finite Automata for the regular expressions for terminals of the lexical analyser, something that my code above doesn't do. Something special that my code above does do (and so does yacc/bison) is that it assigns a precedence and associativity to a group of productions, this allows the programmer to select the action to perform on a shift/reduce conflict, depending the desired behaviour, which is what the precedence groups are for. The gold parser system generates a reduction every time, which means you need to include extra productions if you want operator precedence.

The Yacc parser generator, which also implements this algorithm accomplishes this by assigning a precedence and associativity to a particular nonterminal, and if that is included in a production, the production is given that precedence and associativity; my code above skips this step and the programmer assigns the precedence and derivation directly to the productions in a given precedence group.

I like the use of EBNF as an input language in GOLD, I haven't decided what to do with the front end of mine, but I'd like to go for something that allows me to specify the rules above.

I wonder though, is the 'shortcut' in defining precedence and conflict resolution (when you otherwise have to simply add a transient production that can be trimmed from the tree before you even see it in 'user code' as per TrimReductions in the engine specification of GOLD Parser) algorithmically worth the trouble? Certainly there is nothing wrong with an academic exercise here, and by all means go for it if that's what you're after...

There has been much debate over how it would be noted in the input language to keep things kosher on the GOLD mailing list over the past couple of years, and it's pretty much been voted down as an ambiguous feature. In the dragon book, the examples use the method of constructing transient productions as nested productions of one and other, multexpr, addexpr, etc... I think that's the standard fare for LALRs.

I'm not trying to discourage you, but if that single feature is what you're shooting at, it might be as simple as learning to live with TrimReductions in GOLD engines and a few extra lines in your grammar.