Incremental Packrat Parsing: The Secret To Fast Language Servers

An adaptation to a 16 year old parsing algorithm is ideal for providing real time code diagnostics in Visual Studio Code, Atom, and more.

Eric McCarthy Aug 19, 2018

If you are like me you probably don’t pay very close attention to the change logs of the source code editor of your choice. So you may have missed the quiet revolution going on under the hood. To provide advanced features like IntelliSense, live syntax validation, jump-to-definition, and parameter info boxes, code editors spin up child processes that are responsible for extracting the semantic information from your code used by these features. This approach has tremendous advantages, but there is one problem: many existing parsers are not up to the task of being integrated into a language server.

The Language Server Protocol

The child process for analyzing your code that your editor spawns—if your editor is Visual Studio Code, Atom, Sublime Text 3 with sublime-slp, or IntelliJ with intellij-lsp—communicates with the editor process using a standard called Language Server Protocol (LSP). The wide base of support for LSP means you no longer need to write a language plugin for every editor—or more likely, write just one plugin and tell everyone who uses your language to use that one editor. Instead, you can create a single language server and it will work in any code editor that supports LSP!

When people first hear about language servers there is sometimes a bit of concern because it sounds like something that will transmit your keystrokes to a remote server or even a third party. That sounds neither fast nor secure. However language servers are not meant to be run remotely over a network. Instead, they are meant to be spawned locally by the editor application, and communicate via IPC. It’s a server because it provides a service (code diagnostics and semantic information) to a client (the code editor).

If you want to create a language server for a language—say a lesser-used domain specific language or a bespoke programming language used by your company—you will first likely look at integrating the existing parser into the server. This can often produce adequate results, but there are many pitfalls to be aware of.

Integrating Existing Parsers Into A Language Server

There is a decent chance an existing parser is not error tolerant. In other words, if it hits a syntax error, it reports that single error and nothing further. To be useful for a code editor, the parser needs to do its best to recover in the face of transient syntax errors as the user modifies the document.

Existing parsers for domain specific languages may not focus on providing meaningful error reporting, so it may not do things like track character offsets of tokens. Since the editor needs to know where in the source file things are, you won’t get very far without making sure your parser records offsets in the parse tree. Also note that LSP requires offsets in UTF-16 code units, so if your parser assumes 8-bit code units (ASCII or UTF-8) or 32-bit code units you will need to find a way to prevent emojis and other characters in source code from causing your offsets to be wrong.

Parsers for bespoke programming languages are often not built for extracting semantic data. Instead, their focus is on providing a parse tree that an interpreter can walk as it executes. You will probably have to create your own API to get meaningful data out of the resulting parse tree.

Existing parsers are also not always written in a language that you can easily integrate into a language server. The Language Server Protocol makes use of JSON-RPC, which has implementations in many languages. But not every language makes server programming easy, so you may find yourself doing a lot of work just to get a server up and running. Likewise, if you want to use an existing language server implementation as a starting point, and it is not written in the same language as your existing parser, you will need to write a lot of glue code to pass data between the different runtimes.

There is one other problem that will effect just about any existing parser, which is that they are unlikely to be incremental parsers—they will need to re-parse an entire document on each edit in the code editor. For small source files this will be fine, but even moderately sized files could produce a significant delay before updated semantics can be returned to the client editor.

Given all these issues it becomes reasonable to start a new parser from scratch. And if you are going to do that, you may as well create an incremental parser to ensure it will be as performant as possible.

If you could use some help implementing a parser for a language server, get in touch!

Incremental Parsing

The obvious way to make a language server return its results quickly when a user makes an edit is to, well, only re-parse the parts of source code that have changed. Unfortunately this is a bit trickier than it may sound. You need to account for all the parts of the parse tree that an edit can affect.

This is a cool solution, but as mentioned in a footnote in that blog post there are frustrating edge cases: “an edit that, for example, adds async to a method can cause the parse of await(foo); in the method body to change from an invocation to a usage of the await contextual keyword.”

A new approach called Incremental Packrat Parsing presented by Patrick Dubroy and Alessandro Warth in 2017, seems to take into account issues like this and seems easier to implement as well. The idea behind it is that some straightforward changes to packrat parsers can make them into very good incremental parsers.

Packrat parsers have been around a while now, but I suspect are not in widespread use. As their name suggests, they don’t throw away anything. Instead, they keep the results of every production in case it is needed later during backtracking. This saves time at the expense of memory.

Interestingly, memoizing every intermediary result in a memo table with a column for each character offset makes for a convenient way to do incremental—or partial—parsing. As the paper details, you can implement a relatively simple function that applies edits to the input string and invalidates affected memo table entries. When you then rerun the packrat parser on the input, only the parts affected by the change are re-parsed, because the rest will be retrieved from the memo table.

Another key insight that Dubroy and Warth detail is the need to store the maximum examined position for every memo table entry. This “records the furthest position examined in the input stream over the entire course of parsing a rule.” This is critical, because ensuring this value is correct should prevent frustrating invalid parse tree errors that would accumulate as a document is edited.

If you decide to implement an incremental packrat parser in JavaScript, you will likely want to use immutable data structures for both performance and to keep memory usage to a minimum. While I wish the documentation were better, Immutable.js seems to be the best JavaScript library for immutable data structures.

Conclusion

I’m pretty darn excited about seeing this parsing technique make its way into language servers. If you are starting a new parser for a language server, be sure to read the paper! If you do not have an ACM Digital Library subscription, the authors have provided an open access PDF.

Eric McCarthy is a software engineer and owner of Unallocated LLC. If you have any feedback about this post or just want to get in touch, go ahead and send an email!