Google SoC Series: ANTLR v3 Ruby Parser

XRuby is an effort to create a Ruby to Java bytecode compiler. A compiler, of course, needs a way to parse the input language, and so the XRuby team created their own Ruby parser using the popular ANTLR parser generator. Parser generators take the grammar of a language and generate code to parse it. Using ANTLR meant that the grammar and parser had to be created from scratch, which is different from JRuby's approach. Ruby's parser uses another parser generator called YACC, and JRuby chose to reuse this grammar and use a Java port of Yacc to generate it's parser.

When asked about whether Ruby is a difficult language to parse, Wang Haofei, who works on this Google SoC project and the XRuby team, explains:

Xue has fixed a few bugs since then, but generally it is very stable. We will write and run more tests during the SOC project and it may help us to uncover some unknown issues.

Xue Yong Zhi is the mentor for this Google SoC project and also works on XRuby.

One big part of the Google SoC project seems to be porting the existing parser to ANTLR's new version 3. Wang Huofei:

1. ANTLR v3 is a rewrite of v2 and has significantly enhanced parsing strength via LL(*) parsing, v2 is much weaker (limited LL(k)) and it forced me to add some hacks to workaround some problems. While ANTLR based parser is easier to maintain than others, migration to v3 will help us to make it better and cleaner. 2. There will be a ruby backend for ANTLR v3, so that we may be able to have ruby parser in ruby. 3. ANTLR v3 's performance is much better.

The second item in this list brings up an a very interesting point. Ruby lacks a Ruby parser written in Ruby. This is an issue for writing tools that handle Ruby code. Code analyzers, refactoring tools and automated refactorings, formatters and more, are difficult, if not impossible, to write in Ruby, because there is no way to parse Ruby source with Ruby code. There are workarounds, like Ryan Davis' ParseTree which uses the parser of Ruby's interpreter (via a native extension) to get at the Abstract Syntax Tree (AST) for Ruby source. An AST is a tree representation of source code, and is necessary for tools that need to know about the structure of the code. Yet, even ParseTree is not a complete solution, since it's current versions don't give the source locations of individual nodes. Obviously, a refactoring algorithm that, for instance, wants to rename an identifier in a Ruby source file needs to know where the identifier actually is.

This issue has become ever more obvious in the past year, due to the arrival of various Ruby IDEs. These ship with code analyzers (to warn about potential errors in the code) and the Eclipse based RDT is the first IDE to feature extensive refactoring support for Ruby. Other features are support for popular Ruby based files, such as Rake files, Ruby's equivalent of make or Ant. The problem: these tools are all built with Java (or other languages) and Java IDEs all use JRuby's parser.

This means that the functionality of these tools is locked in those languages, and even worse, often tied to particular IDEs. For instance, the logic for RDT's refactoring support is not available for, say, Ruby in Steel, an IDE built on Visual Studio. This is different than, for instance, in the Java space, where parsers are available. Tools such as PMD or Findbugs are naturally written in Java and thus available wherever Java runs and, more importantly, can be extended with Java code.

It depends on how well we are doing. We are willing to do that even if it may not fit in SOC schedule.

Good news.

One necessity for making code tools is the AST, to analyze source. ParseTree, as already mentioned, offers a format to represent Ruby source code. Tools, based on ParseTree, already exist, such as Ruby2Ruby which can turn an AST back into Ruby source code; useful if a tool wants to modify an AST and then output it as Ruby source. Rubinius, a project to implement a Ruby VM in Ruby, also uses ParseTree output to compile Ruby to the Rubinius bytecode, which is then interpreted. When asked about the output of the parser, Wang Haofei explains:

ANTLR has its own builtin AST support, and it is very handy if you want to serialize to a string or change to other struture. It looks similar to ParseTree's output. In XRuby we turned the AST into a DOM-like structure and use visitor pattern to generate java bytecode.

While ParseTree output doesn't seem to be planned, it's entirely possible to translate the ANTLR generated AST into the ParseTree format. A similar approach is already used by JParseTree, a port of ParseTree that works on JRuby, now a part of JRuby Extras which provides JRuby ports of popular Ruby libraries.

This step to ANTLR v3 parsing is promising!On a side note it's not the only stopper for a full-ruby IDE, do you know any robust crossplatform Rich Client framework in Ruby (or easily pluggable with it) that can compete with the ones in Java or limited-platform natives C guis ? Tk & friends are a joke.

Well, there are lots of GUI toolkits, but I haven't used any of them. The only GUI Ruby code I wrote was in JRuby using SWT. I tried using FreeRIDE freeride.rubyforge.org/wiki/wiki.pl but I'm not sure how alive it is. The problem is, of course, that writing an IDE from scratch is a _big_ task, and that's just the start, because the IDE is only really valuable when an ecosystem develops around it. Eclipse has that (and with RDT/RadRails has great Ruby support), I guess Visual Studio has that too. Frankly, I'd prefer using Eclipse RDT, but I'd like to have things like a Ruby Lint or Refactoring or other tools written in Ruby. This is possible now (the code can be run with JRuby) but this code would rely on the JRuby AST, which, naturally, isn't available anywhere else.I wrote about the need for a standard AST in Ruby in more detail here:jroller.com/page/murphee?entry=ruby_let_s_get_an