Pages

Meta

Tag: dgrok

Halfway there! DGrok can now parse 46 of the 91 Delphi RTL source files — a hair over half.

Except that, of course, it’s way more than halfway; it took a lot of work to get this far. Most of what’s left is the various statements like repeat, with, try..finally, etc.

And then, of course, there’s the fact that it’s probably way less than halfway. Include files ({$I} / {$INCLUDE}) aren’t working, and I haven’t figured out how they’ll fit into the demo app yet. And there’s the whole issue around symbol tables, which I’ll need to do anything useful like refactoring.

Still, it’s a major milestone, so I figured I’d do a release. DGrok 0.3 is now available for download. Major new stuff in this release:

If you double-click a file that’s failing because of a compiler directive, it will now take you to the error location rather than showing a .NET “unhandled exception” dialog.

Began adding statement handling. Currently it can’t handle much — mostly method calls, assignments, and if statements. This is my main area of focus right now.

Parsing of method implementations, including the smarts to not expect a method body if the method is declared “forward” or “external”.

Parsing of unit implementation sections.

Parsing of “program” and “library” files.

Fixed parsing of “const” sections that come inside a class or record. (When the const section was followed by another visibility section, it was getting confused and thinking the “public” was the name of another constant; it doesn’t anymore.)

Many minor tweaks to the grammar (it turns out that semicolons are optional after field declarations; I didn’t have threadvar in the grammar yet; initialized record-type variables; operator overloads; that sort of thing).

Lots of exciting behind-the-scenes stuff that you wouldn’t recognize as cool unless you’d already been working with the code: strongly-typed node properties, generic ListNode and DelimitedItemNode, and partial classes.

It can now parse 13 of the Delphi RTL source files (only 78 more to go). It should be able to parse interface sections in full, with the exception of class (and record) helpers and records with variant sections.

I also made some improvements to the demo app. The biggest change, apart from fixing the hard-coded path, is that you can double-click a filename in the treeview. This brings up that file in a new window — and if there was an error parsing the file, the cursor is positioned at the error location. (But no, there’s no GUI yet for specifying IFDEFs.)

Well, that’s just great. I had thought that directives came after the required semicolon, and that each directive had its own required trailing semicolon. Now I come to find that they come before the required semicolon, and each directive has its own optional, leading semicolon. Which meant I had to spend a couple of hours rearranging my grammar and updating my regression tests, since all directives now needed to use the same class (with its anomalous leading semicolon) and the logical order of MethodHeadingNode’s properties had changed.

And then there are procedural types (type TFoo = procedure, procedure of object, etc). I hadn’t even realized that they could have directives, since most directives don’t make any sense for a procedural type, but I forgot about calling conventions. As it turns out, not only do procedural types support directives, they support three very distinct syntaxes:

Yep, any directives can show up either before or after “of object”. And the semicolons are optional in the ones after “of object”. But no, you can’t put any semicolons before “of object”. Consistency? Who needs it?

Makes me really glad I’m hand-coding my parser — automated parsing tools would’ve gone nuts with the ambiguity in something like “zero or more directives, followed by an optional ‘of object’, followed by zero or more possibly semicolon-delimited directives”. Not to mention the variable declaration with a semicolon in the middle. It might be doable with an automated tool, but not without a lot of pain. With a hand-coded parser, it’s just a place to write more automated tests to cover the goofy behavior.

But… goofy or not, it’s valid Delphi, so I’m going to do my best to parse it. So my ProcedureTypeNode now has two properties for directives: FirstDirectives (which comes before Of and Object), and SecondDirectives (which comes after). For any non-of object types, SecondDirectives is always empty.

Sheesh.

1 No, assembler doesn’t make any sense in a procedural type. But that’s okay, because far doesn’t make any sense in Win32 at all. The compiler just ignores them both, so they’re harmless, if meaningless. The only other directives that are allowed on procedural types (besides near) are calling conventions, and I figured an example with assembler and far would be less confusing than an example with two or three contradictory calling conventions. (Which is perfectly valid, by the way, although I have no idea what it does.)

If you tried to download DGrok before now, you may have had trouble opening the ZIP. Sorry about that.

I use 7-Zip on my home dev machine, because it’s free and open-source. It also comes with a command-line EXE, so I made a Rake task to automatically build a ZIP file for the DGrok distribution (took me most of Sunday to get everything right). It, ah, didn’t occur to me that 7-Zip would default to using its own file format, instead of standard ZIP. (It worked fine on my machine!)

I dug through the docs and found the “no, really, make a ZIP file” parameter (-tzip, if you’re interested). The updated DGrok 0.1 is now available for download. Let me know if the download causes you any problems.

I can successfully parse four of the source files that ship with Delphi. I’d say that’s a major milestone. So I’m releasing version 0.1 (alpha) of DGrok.

The source code is included in the download. DGrok is open-source, under the Open Software License (I’d rather use GPL, but I’m stuck with OSL because of NUnitLite).

I’ve included a GUI demo app that shows off the current capabilities a bit. It has two major screens: Ad-Hoc Parsing and Parse Source Tree.

Ad-Hoc Parsing lets you type in some source code, select which parse rule you want to use, and click Parse (Alt+P). The box in the lower right shows either the parse tree (if parsing was successful) or the error message. Additionally, if there was an error, the focus is put back in the edit box, and the cursor is moved to the error location.

If you want to type an entire source file, select the Goal rule (this is selected by default when the app starts). Or you could use Unit, if you know it’s really a unit and not a project, library, or package. If you just want to play around with expression parsing, select Expression. Or whatever. There’s a Grammar.html included in the ZIP that shows which rules are working in this release, and to what extent.

The Parse Source Tree tab lets you point DGrok at a directory, and set it loose. It will automatically search through subdirectories for .pas files (I should probably make it look at .dpr and .dpk files as well), load them, and try parsing them. (Since it knows they’re entire files, it doesn’t need to ask you which rule to apply; it automatically uses Goal.) If a file parses successfully, it gets listed under the “Passing” node; otherwise, the files are listed by error message. As you can see, there are still a lot of errors, so I must not be done yet.

Note that there isn’t currently a GUI for telling it which $IFDEFs evaluate to “true” and which evaluate to “false”. And if it doesn’t know whether something is true or false, it bombs out with an error. This is on purpose — I wanted to make absolutely sure I didn’t miss anything that should be defined — but it’s probably inconvenient if you’re trying to parse anything other than the Delphi RTL source that I’ve already tuned it for. I’ll get a GUI for this in a future version.

It reads like the classic syntax for “control-M”. It’s valid Delphi grammar, it compiles, and it works.

That said, I have no plans to support it in my parser. I find the string literals during the tokenizing pass, and at that stage, I can’t tell the difference between the control character (^M) and the pointer type (^J) in this snippet:

const
CR = ^M;
type
J = ...;
PJ = ^J;

Pointer syntax is used a lot more often (translate: I’ve only ever seen one source file with ^M string-literal syntax, and that was in-house), so I’ll give preference to being able to handle pointers correctly.

I have thought about doing some string-literal folding at parse time… for example, to join

into a single StringLiteral token in my parse tree. (This would make it possible to write a frontend that provides a “find in string literals” feature, and make it able to find “pede leo” in the above snippet.)

If you’re a hardcore VCL geek like me, you probably already know about Delphi’s “type type” feature. But I learned some interesting details about it last night while discovering the Delphi grammar.

The documentation doesn’t give a name to this language construct, so “type type” is a name I made up. It’s when you prepend “type” to a type declaration to give it its own type identity:

type
TColor = type Integer;

I won’t go into any details on what you would use this for, because it’s not that useful for anything outside the Object Inspector. You probably wouldn’t even notice it was a different type if you never used it as a var parameter. But it’s more useful than goto, and I’ve got that in my parser…

Anyway, I found some interesting details about “type type” in my research. Specifically, there are only three type constructs that allow you to prepend “type” to them: identifiers, string, and file.

I suspect the reason for this is that (according to the documentation) every time you declare something like string[42], it’s automatically considered a distinct type from every other string[42] you’ve ever defined — and therefore has its own RTTI identity, and isn’t var-parameter compatible with the others. You don’t need to declare type string[42] because it’s already distinct.

I found type file to be particularly interesting. Even if you’re dealing with an untyped file — the ones where you have to pass that second parameter to Reset and Rewrite to make them even remotely useful, the ones that have been utterly replaced by TStream — you can still make distinct types, and let the compiler make sure you pass the right ones to your var parameters. That’s actually an interesting feature, since file parameters always have to be var. I wonder if anybody has ever used this.

If you put the sealed keyword more than once, does that make the class somehow extra-sealed?

Interestingly, sealed and abstract are both what I’m calling “semikeywords” — they only have a special meaning in a particular context. Outside that particular context, they can be used as plain old identifiers. So you could actually have a field in the class called sealed or abstract… as long as it’s not the first thing after the class keyword. Add another field first, or a visibility specifier, and you’re fine:

I’ve been researching the syntax for class helpers, and found some very interesting things. First, that class helpers can descend from other class helpers. And second, that they can have virtual methods.

Class helpers, for anyone not familiar with them, are a way of adding methods to an existing class — or at least, making it look like you do. The existing class is left unchanged; you might just as well be writing unit procedures that take an instance as their first parameter, except that class helpers make the code look nicer because you’re actually saying “Foo.NewThing” rather than “NewThing(Foo, …)”.

Now, since you’re not actually modyfing the existing class, your class helper can’t have any fields (there’s no place to store them). Nor can you override methods from the class you’re helping (since that would involve changing the VMT). So this whole “virtual” thing really surprised me.

So back to the interesting discoveries. First, class helpers can descend from other class helpers, but the syntax isn’t what I would have guessed:

Presumably this would only make sense if they were helpers for different classes, but that’s the syntax: the parent class goes after “helper”, not after the entire “class helper for” clause. The parent must be another class helper (not an ordinary class).

Now, the really interesting bit: the compiler lets you put virtual methods on these guys.

I was curious; I added those methods. Since you can’t add fields (e.g. FRefCount) to a class helper, I made _AddRef and _Release both return -1, to indicate that the class wasn’t refcounted. Then I wrote some code that called that virtual method, and ran my app.

Interesting.

So I looked at the assembly code that was generated for that virtual helper call. And sure enough, it was looking for an interface:

Very interesting, says I. That explains why it wanted me to implement IInterface on the helper: it somehow uses interfaces to deal with this “virtual helper method” business. But exactly which interface was it looking for, and why was it crashing? What else did I need to do? What interfaces did I need to implement? How could I implement interfaces, when the compiler doesn’t allow interface syntax (“class helper (TFooHelper, ISomething)” fails with “‘)’ expected but ‘,’ found”)?

So I opened up System.pas to look for this @GetHelperIntf method. Here’s the lone line of code in its implementation, along with the answer to why the app crashed…

The research

The first step was to do all the research to figure out what the Delphi grammar is. This is not easy. The Delphi 5 documentation included an incomplete, and sometimes wildly inaccurate, BNF grammar. The Delphi 2006 documentation no longer includes the grammar; either the documentation team lost it (along with the docs for the IDE’s regular-expression syntax), or they gave up because it was so far out of date. The language has added loads of features since then: strict private, records with methods, even generics on the way.

So I had a rough sketch to start from, and an undergraduate compiler-design class from ten years ago. The rest — correcting the errors, and filling in the (large) blanks — is trial and error, and a lot of refactoring.

The upshot is, if you see something I’m missing, let me know. Fatih Tolga Ata already put class helpers and generics on my radar — although I can’t really do much with generics yet. Since there is no official (correct) grammar from CodeGear, my main method of discovering the grammar is to type stuff into the IDE and see what compiles (and, often more instructively, what doesn’t), I won’t be able to figure out the generic grammar until I have Highlander.

The tool

As I puzzle out the grammar, I document it in a YAML file. Here’s a snippet from this file:

The ! at the beginning of a line means “I’ve implemented this in my parser”; the . means “I haven’t implemented this yet”. That’s what drives the “(Completed)” and “(In progress)” notations in the grammar document.

I wrote a Ruby script that reads this YAML file and generates the HTML Delphi grammar documentation. That Ruby script is the part that’s cool enough to figure out which rules are fully implemented (shown with a solid underline), which are partially implemented (e.g., Atom, as shown above; shown with a dashed underline), and which ones I haven’t started on yet (no underline). It also figures out the “Likely Targets” — the rules whose dependencies are (fully or partially) done: the ones I can (probably) work on next.

I edit the YAML file frequently — as you can imagine, since it reflects my completion status. And I refer to the generated HTML document just as frequently. So I’ve made the Ruby script part of my Rakefile. It works out fairly well.

Of course, uploading the HTML doc to my Web site happens… a little less frequently. I just uploaded the latest version this morning, but before that, it looks like it had been a little more than a year since my last update. I’ll try to keep it a little more active.