Musings of a software developer

I’ve started to write an Antlr grammar for Bison, with the goal of automatically converting the grammars to Antlr, or another parser generator for that matter. As it turns out, the “central dogma” of parsing (i.e., “you cannot use an LR grammar in an LL parser, and vice versa”) is untrue with the unlimited symbol lookahead parsers that are available nowadays. The major issue will be handling static semantics. Many Bison grammars embed tree construction into the grammar, as well as performing static semantic checks. All this needs to be ripped out and done in a clean manner.

I have a grammar for Bison that works pretty well on eleven different grammars of varying complexity. I am now looking into a Github scraper that searches for and collects all public Bison/Yacc grammars so I can place them in a test suite.

–Ken

Update Feb 24 2020: I have a Github crawler that is now downloading 32K Yacc grammars. I plan to test each of them with this Bison parser. Here is the script (more or less without chunking issues resolved).

Having now implemented an LSP server and two different clients for Antlr, I can say I am more familiar with the LSP protocol. From these implementations, what can I say? LSP works well for single client/single server instances, but I am wondering how I can use LSP in a situation that modifies a workspace, reads and alters non-Antlr source code from the Antlr LSP server.

What I want is to add a “go to visitor method” of a grammar symbol feature to my extension. What I mean by this is that I want to find a C#/C++ method that corresponds to the non-terminal symbol (see this documentation for an explanation of a visitor/listener in the Antlr runtime). To do this, the Antlr LSP server needs to parse the grammar to get the name of the grammar symbol the cursor is currently on. But, it needs to parse C# (or C++, etc) as well. Currently, what happens is it assumes the source has been written to disk and the Antlr server uses the Microsoft.CodeAnalysis APIs directly. The user opens a Solution or Project in VS2019, opens a window on an Antlr grammar source file, positions the cursor on a non-terminal, then asks for “go to visitor method”. The Antlr server is called with a custom message “go to visitor method” because the knowledge of Antlr grammars is in the server. The server parses every C# (but could be in C++ or another target language) to find the visitor method corresponding to the grammar non-terminal. If it exists, the server reports the location back to the client. If it doesn’t exist, the server now may create a source file, or modify a source file. It may need to add a class, or a method to correspond to the grammar non-terminal. Note that the Antlr server has no knowledge of whether the client currently has an LSP server opened on the C# source. Nor does it know if the copy is currently being modified but not saved out. It doesn’t modify the workspace and report it back to the client, but it should.

The problem with this scenario is that there is no shared LSP server for the C# code–it’s owned by the editor and/or the LSP server that parses the C# source code. However, I need to parse the source code in order to provide this navigational (and possible source code generation) feature. Is there any coordination of the services between servers?

After several months of work, I’ve finally released version 5 of the Visual Studio 2019 extension for Antlr, AntlrVSIX. This version is a re-architecture of the code using a Language Server Protocol implementation. The extension is slimmer because it now only focuses on support for Antlr grammars and removes some of the features that LSP does not support (e.g., “go to listener/visitor” and colorized tagging). However, I now have added a template for creating C++ Antlr programs (in addition to the familiar C# Antlr template). The templates use an updated version of the msbuild rules and tool in Antlr4BuildTasks that works for csproj and vcxproj files (C# and C++ projects, respectively). There is a Net Core template in Antlr4BuildTasks.Templates as well. –Ken

The next step in the development of my LSP server for Antlr involves support for code completion. There is an API written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a bug in some situations. So, I’ll now describe the algorithms I wrote to compute the lookahead set for Antlr parsers. I’ll start first with a few definitions, then go over the three algorithms that do the computation. An implementation of this is available in C# in the Antlrvsix extension for code completion and another is provided by default in the Antlr Net Core templates I wrote for syntax error messages.

For those interested in creating an Antlr4 program using C#, I wrote a dotnet package and uploaded it to Nuget. There is similar functionality in the VS2019 extension AntlrVSIX, but I am starting to move towards a Language Server Protocol client/server implementation for Antlr. This package capitalizes on the work I did with Antlr4BuildTasks supporting MSBUILD builds using the Java Antlr tool v4.7.2.

Even as noted in the old MS documentation page, many of the client features are enabled, e.g., go to def, find all refs, reformat, hover, and typing completions. What is missing is building and debugging. But it is very usable.

After wasting a bit of time the last few days, I figured out how to get the Gnu Emacs editor to work with the Omnisharp-Roslyn LSP server for C#. Finding the right solution required a lot of trial and error work because I work mainly on Windows, and that is completely sacrilegious.

This is an article about my ongoing research regarding the state of Antlr grammars for parsing Java. Some of the tests are taking weeks of computing time, so the results are preliminary.

Antlr is a popular LL(*) parser generator for recognizers of C#, Java, and many other programming languages. For Java, there are three grammars available on the Antlr grammar website: Java, Java8, and Java9. If you are a developer who hasn’t followed the maintenance history, it is unclear which grammar one should choose. Some of the changes that have been made to one grammar have not been applied to the other grammars. The basis for all grammars, however, is The Java Language Specification. Unfortunately, the latest available now is version 13 making all of them out of date.

1/2: Adding to #Antlrvsix an analysis tool of #Antlr grammars. This is how it works with cycle detection and useless lexer rules. There are some issues in the responsiveness of the MS LSP client for VS2019.

Adding to #Antlrvsix the refactoring to remove useless parentheses in an #Antlr grammar. This is how it works with the extra parentheses in the arrayAccess_lf_primary rule in Java9.g4 that nobody knew were there. Only yet starting to scratch the surface of grammar optimizations.

Implementing #Antlr grammar fold refactorings in #Antlrvsix. Two types: extract a selected sequence of symbols and make a rule (shown first); replace all occurrences in the grammar with a folded rule (shown second). Spacing and comments do not matter.