Wouldn't it be nice if you could "subclass" a language like C# and then add 5 lines of code to make it work and then use it? What if you could add Ruby-like method_missing support in 20 lines?

What if you could conceive of a new language and start experimenting with it in production code in an hour? What if it leveraged your knowledge of existing frameworks like .NET?

Let's imagine that it were possible. Let's say someone else created a basic calculator-like "language":

Even without knowing the syntax, you can probably figure out how it works since it's written very close to the standard way of describing a language. If you give this language "3*4", it will respond with 12. If you give it "2^(3*4)", it'd respond with a nice syntax error message like "I don't know how to deal with the '^' symbol. Sorry!"

But you could make it work by writing this object-oriented-like "subclass" of the language:

Now, giving this language "2^(3*4)" happily gives you 4096.

This worked using simple inheritance. What if you wanted to compose your language from several other languages? That is, something like "multiple inheritance" but without its problems? What if we took a simpler approach and allowed a language to attempt to match elements of any other language?

Let's say we want to build a scientific calculator that leverages some of the work we've already done. Instead of inheriting from it, we'll call out to it:

Now you can calculate expressions like "sqrt (1+2^3)!"

What if such a meta language (e.g. a language for describing other languages) existed? What would you do with it? How might it help you get thingsdone while producing extremely small, but very readable code? What if creating a new language that perfectly fit a problem your software was trying to solve was as easy as writing a simple regular expression?

What if you could get all the whiz-bang, gotta-have tool support like syntax highlighting, interactive debugging, and a compiler for it came at almost no extra cost? What if the barrier to entry for a new language was so low, that almost anyone could do it quickly?

".. a new object-oriented language for pattern matching. OMeta is based on a variant of Parsing Expression Grammars (PEGs) —a recognition based foundation for describing syntax—which we have extended to handle arbitrary kinds of data. We show that OMeta’s general-purpose pattern matching provides a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree transformers, all of which can be extended in interesting ways using familiar object-oriented mechanisms. This makes OMeta particularly well-suited as a medium for experimenting with new designs for programming languages and extensions to existing languages."

To be honest, I didn't really understand the paper when I first read it. It sounded cool, but it used a few new terms I hadn't heard of before. However, I couldn't help but be fascinated by OMeta's readable syntax. OMeta doesn't try to solve every problem involved with writing a complete program. It makes no claims about being a good general purpose programming language. But it positions itself by doing just one thing really well:

Pattern matching.

If you look at modern programming languages that are good for developing other languages, you'll notice that they usually have good pattern matching capabilities. For example, in ML (which stands for meta-language), you can specify functions in terms of patterns. Here's how you can reverse a list:

This defines the function named "reverse." When it sees the "pattern" where the function is called with an empty list (nil), it just returns nil. However, when the function is passed in a list whose first element is "x" and the rest of the list is "xs", it will return the reverse of the rest of the list and then append the head of the list to the end.

If you think about it for a second, it's a really compact way of expressing how to reverse a list.

Streams of Anything

What I find interesting about OMeta is how it matches patterns. OMeta can operate on a stream of anything. This is useful because modern language implementations have several phases that work on different types of "things." Here's a simplified view of what a compiler does:

At the start of the process, you're working with individual characters from source code. The list of characters is converted to tokens by use of a scanner. This is how the characters 'i', 'n', and 't' become the "INT" token. The next phase combines tokens together to form a higher level view of the program, typically using a tree data structure. From that phase, several intermediate steps occur on the program (almost always in tree form), until it's finally pulled apart to make an executable program.

Often, you have to use different tools at each phase. You might use lex for the scanner (aka "lexer"), Yacc for the parser, and some sort of visitor pattern in your implementation language (e.g. C#) for many of the intermediate phases. With all of these different tools to learn, it's no wonder why most people don't even bother trying to create their own language.

OMeta is different in that lets you use the same syntax for all phases. The only thing that is different is what the input stream contains. At the start, the input stream is characters. In subsequent phases, it tokens, then trees, and then finally you produce your intended result. At each pass, your OMeta code has similar syntax.

Passing the Match Around

Another interesting bit about OMeta is how it goes about finding rules. Rules are defined in grammars which behave like classes in an object-oriented language. A grammar can "inherit" from another grammar. An added twist is that a grammar can call out to a "foreign" grammar. This makes it easier to build a language from many different reusable parts and allows for trivial "mashups" of several languages.

Additionally, if a rule has multiple "choices," then OMeta always tries to match the first choice, then the second, and so on until it achieves success. Although this seems obvious in hindsight, most parsers don't have prioritized choices and thus they make it easy to define ambiguous rules. For example, is "1-2-3" equal to "(1-2)-3" = -4 or "1-(2-3)" = 2? With prioritized choice, you make this explicit. If it successfully parses, then it has exactly one choice and therefore no ambiguities.

The last really interesting aspect of OMeta's design is what it does when it matches a pattern. A pattern might consist of several sub-components. Each of these components (e.g. rules) can be captured into a variable. OMeta then feeds these variables into a "host" language.

OMeta doesn't care what the host language is. It can be just about any language (e.g. JavaScript, LISP, C++, etc.). Typically, the host language is what the OMeta implementation itself was written in. The host language is free to do whatever it likes with the variables. OMeta is only responsible for getting them there. The OMeta paper describes an implementation where the host language is cola. Other early versions were written in Squeak and use that as the host language.

Alessandro has recently created a JavaScript version where you experiment with OMeta/JS (e.g. JavaScript is the host language) in your browser. It's a very interesting playground. Some of the demonstrations range from the basic calculator language, to a type checker, to LISP, to even the OMeta compiler and code generator itself and all of them are written in OMeta/JS. The playground website is neat to use because it shows the translated (generated) JavaScript code for the grammar (although the generated code needs to be put through a pretty printer to make any sense of it).

Why the "#" in OMeta#?

After understanding the basics of OMeta/JS, I thought it would be interesting to have the host language be any .NET language. I started to sketch out what this might look like by using my limited JavaScript ability and translating Alessandro's code to an equivalent functionality in C#. The result was the start of a project that needed a name. "OMeta#" seemed to fit the bill.

Besides, it just sounded better than other .NET-ism alternatives like "NOMeta", "OMeta.net", or "IronOMeta."

Some might ask, "Why .NET?" A subtle reason is that OMeta is useful for code generation. Code generation is a normal part of the .NET ecosystem even if people don't realize it. When people use the graphical designer in Windows Forms, there is a code generator working behind the scenes writing code for their host language (typically C# or Visual Basic). When people write ASP.NET ".aspx" files, there is a hidden process that takes that HTML-like "language" and compiles it to a real DLL.

For example, ASP.NET transforms this:

Into this generated code:

I wanted to bring a similar "it just works" illusion to the .NET world with OMeta#.

With the encouragement of Alessandro, I started working on some of the supporting "goo" that would be required to get OMeta working on the .NET's Common Language Runtime (CLR). This isn't trivial because the CLR has stricter type system than languages like JavaScript. Also, since I chose C# as the implementation language, I had to make sure that the code passed all of the static type checking it performs at compile time. It's been a slow start, but I have been able to make a few steps towards making it work.

Ultimately, all of those tools are pretty good, but none of them did exactly what I was looking for. Some came close, but none of them seem capable of matching OMeta's power and simplicity.

Another reason was that I knew that writing an implementation of OMeta from the ground up would really help me "know" how it works.

More importantly, I thought it would be fun to work on. I'm not a language expert. I'll never reach the level of someone like Anders, but I was one of those odd kids that actually liked my compilers and programming languages classes. I also enjoyed the talks from this year's Lang.NET symposium (even if I didn't understand a lot of things).

The Grand OMeta# Vision

The grand vision for OMeta# is to make it really simple to create languages that "just work." It probably won't have the world's best performance, but it should be decent enough for almost all practical applications. It would be neat to have OMeta# (or something like it) be used to do the following:

Serve as a common, unified way for developing languages on the .NET platform — There are many helpful, yet often different "tools" out there for various stages of language design. I mentioned a few earlier, but some of the more advanced ones include the new Dynamic Language Runtime (DLR) and Phoenix compiler framework. It would be neat to have a language that abstracted away some of these tools into a clean syntax like OMeta. That way, you'd get the benefit of these tools without having to constantly learn different ways of manipulating the concepts that can be expressed simply.

Factor out the "goo" of language development — For a language to be taken seriously, it has to have things like syntax highlighting, interactive debugging, and a decent compiler. Each of these takes time to implement and often involves a high learning curve. It would be great if you could just make some minor annotations to your OMeta grammar and then these features would "just work."

Useful for teaching — One of my favorite textbooks in college was "Programming Languages: An Interpreter-Based Approach" by Ramsey (et. al.). Our class used a "draft" version of it (and it still hasn't been published yet). It took a very interesting approach towards teaching programming languages. In order to understand a language like Scheme, you had to implement an interpreter for it in C or ML. I really liked this approach because getting your hands dirty with implementing the language typically leads to a better understanding of it. I would love to have the ability to implement this type of approach in something like OMeta. You could understand the essence of a language by only having to understand a page or two of code. It would also let you learn by experimenting with adding small features to existing languages.

Practical for production use — For anything to really succeed at moving towards a Moore's Law increase in expressiveness, it has to have reasonable quality. I can imagine OMeta# being introduced into real production projects by first integrating a small amount of OMeta's "generated" a project and then over time implementing more and more parts of the application in this highly specific language. For this to occur, it has to be reliable and simple to use.

When? Where?

All of the parts of the vision are still in the future. To make them a reality will require more than just a single person or even a couple of people. I'm naïve crazy enough to think that it's actually possible. In an effort not to "go dark," I've decided to be fairly open about the design and implementation of OMeta#. The current state of the code isn't even alpha quality yet. Despite all of these shortcomings, I've posted what I have so far on CodePlex and will continue to make updates to it.

Keeping in the tradition of other OMeta implementations, I've licensed it under the MIT License so that you can use, experiment, play, and modify it without any licensing concerns.

The first interesting result is that I've been able to get a complete end to end demo working for an extremely simple grammar that matches numbers. I fed this input grammar written in OMeta/C#:

There is still a ton of work left. It's quite possible that some of the design decisions I've made so far need to be challenged and redesigned. I'd love to get feedback from anyone that's interested.

What do you think? Do you think OMeta fills a unique niche? Do you think it would be useful to have as a tool in .NET? Would you want to help out making "The Grand OMeta# Vision" happen? If so, what parts can you help out with?

I'm curious to hear your thoughts. In my next post, I hope to cover some of the technical details for how my current implementation works "under the covers."

P.S. Many thanks to Alessandro Warth for creating this beautiful language and helping me get started!

26 comments:

Cory
said...

One paragraph before you mentioned OMeta#, my thoughts were "A .NET implementation would be really useful."IMHO, OMeta and the related work at VPRI is the most interesting thing happening in computer science right now. Both the publicity and the code you are generating will help the project a great deal. Keep it up!

This looks awesome. I've recently implemented a parser for allowing users to control my application using sms, and I have ended up with an insane mishmash of extension methods and classes, when all I wanted was to feed a nicely defined grammar into something.

While I was doing it, I realised that by writing the requirements out in a structured version of english, all the information required to generate the c# for my application layer existed, just I didn't have a tool to do the conversion. This looks like it could be the tool.

This is great news! I have already downloaded the source and am looking at it now.

I only recently became aware of the work being done at VPRI, and like Cory, believe that this work is probably the most interesting thing happening in computer science at this time. While we are a long way from having a full working App-to-hardware system in < 20K lines of code, having SOME of that technology available on .NET will be very nice.

To Anonymous and the comment about nemerle: nemerle is a very nice language but the goal of OMeta and the larger VPRI effort is to move the entire computing stack to a higher level of abstraction than is available with current languages, not just extend them with new syntax although that is a side-effect. Take a look at some of the documents available at http://www.vpri.org/html/writings.htm like http://www.vpri.org/pdf/steps_TR-2007-008.pdf where they describe a TCP/IP stack implemented in less than 200 lines of code, or a version of Prolog in about 90 lines of code. The ability to express solutions in such very different problem domains in so few lines of code is not something that can be done in any other language that I am aware of.

Boo author Rodrigo B. De Oliveira is hard at work writing a PEG system for Boo (a "wrist friendly python inspired .net language"), which will then be used to re-write Boo's parser as a PEG. Boo-written-in-boo has been a long term desire for the Boo project (MetaBoo), and PEGs will almost certainly be the tool used to make this happen.

@Jason, that DLR article at bitwisemag you linked uses ANTLR, like almost every other project that needs a parser on the planet. PEGs are usually written in the language you are trying to extend. For example, Boo uses ANTLR, and is hoping to self-host its parser by switching to a PEG implementation written in Boo.

cory: Thanks! I agree, the Viewpoints guys are doing some awesome stuff. My hope is that even bringing a tiny fraction of their ideas to the code I work with daily would be a huge gain.

anonymous: Nemerle definitely lookings interesting. It seems to have support for some nice Lisp-like things such as macros. However, it still seems a bit too much heavy on the syntax side to do what OMeta can do in a clear fashion. Maybe I'm missing something though?

harry mcintyre: OMeta isn't a silver bullet, but I do think that it does pattern matching really well and could help you get going on a domain specific language quickly.

mwatts: My hope is that as a community we can work together to get the vision of OMeta# up on .NET to continue to make steps towards the VPRI goals.

jason: As mentioned by rektide, the ANTLR approach is the traditional way of doing things. It works, but tends to (in my experience at least), produce solutions that aren't reusable or composable. At least not nearly in the way that OMeta would allow.

eric: Thanks for the kind words!

anonymous: OMeta doesn't try to be LISP. However, as demonstrated by http://jarrett.cs.ucla.edu/ometa-js/#Lisp , it does make it possible to *implement* the language in roughly the half page of mathematics that McCarthy used to describe it in http://www.softwarepreservation.org/projects/LISP/book/LISP%201.5%20Programmers%20Manual.pdf

I agree that a lot of cool things happened in the 1960's. As a matter of fact, OMeta's roots go back to Meta II which was developed in that era. However, it was not as fleshed out as the VPRI guys have done so. I think the OMeta vision is helping us get back to the excitement that was there in CS in the 60's.

rektide: I hadn't heard about MetaBoo, but agree that PEG is a nice way to express it. Thanks for including the links as well. I'd like to get OMeta# to the point where you could use Boo as a host language some day. I think that way would allow the best of both worlds.

Someone asked me in an email if the reverse function I demonstrated would be really slow or cause a stack overflow. This is a good question, especially coming from a traditional language like C++. However, the good news is that it will be just as fast as a for loop since functional language like ML do tail call optimizations. For more information, check out this link.

alexey: Great catch! You're right, I said it was tail recursive, but I was absolutely wrong for the reason you mentioned. I need to make sure that the last call is to itself. I made an assumption without taking a second to think about it. Oops :(

An updated function would need to pass in the result so far as an accumulator. How about something like

That way the last call in the function is to itself and thus can be tail recursive. We assume that the accumulator (acc) is reversed. Then, we pop the head off of what we need to reverse "x" and prepend it to the already reversed part. The initial conditions would be "fastReverse(myList, [])"

have you heard of pymeta?it's a python implementation of ometa .i could probably run on iron python , and by doing so , offer ometa on the clr.http://washort.twistedmatrix.com/2008/04/pymeta-is-more-than-just-parsing.html

anonymous: I've heard of PyMeta and meant to link to PyMeta's How and Why explanation because the author had the same sort of feeling to defend why the world needs another way of expressing a language like I did.

I think PyMeta is definitely neat, but the aims are different from OMeta#. PyMeta seems to take a string and do a parse whereas I'd like to make OMeta# more of a front end compiler and simplified language studio (with debugger support, syntax highlighting, etc.)

All that said however, let me know if you have success getting it up on IronPython.

i have tried to work with pymeta on ironpython . could'nt import the string library. it's hard for me to estimate how much work it would take to port it to ironpython , since i have no experience in ironpython.

regarding debugging: since the idea of dsl is to enable writing very little code , the basic debugging model of scripting languages (commandshell +print instructions ) could be pretty usefull and relativly simple to implement.

anonymous: Hopefully we'll get to the level of interactive debugging like that of Visual Studio with C#, but I'm sure it's a lot of work. Interesting on getting pymeta up on python.net. Let me know if you build some languages with it.

leppie: First off, I want to say that I think your IronScheme project is really interesting and I have a lot to learn from it.

All I know about macro hygiene is what I just read about it on Wikipedia. However, from what I understand, I think OMeta as a language handles this in a sort of straightforward way.

The language is grammar based. Grammars can inherit from other grammars. The main unit of work is the rule. Rule names must be unique in an inheritance chain (although overriding is freely permitted). The important thing from a hygiene perspective is that rules can call out to foreign grammars. Those foreign grammars can then use rules with the same name to mean entirely different things. This is specifically mentioned in the 2007 paper. It's not perfect, but it's similar to how people think in most mainstream languages like C# and Java.

I can't speak for Nemerle because I don't have experience with it. Maybe someone else can?

Harry Pierson: Thanks for stopping by! I checked out your Lang.net presentation on PEG's in F# (and linked to it in this post) as we share similar excitement about them.

You're right that it seems a bit odd for choosing C# and not something more language friendly like F#. I did this intentionally for a few reasons. The main reason was to eliminate dependencies. I didn't want to require a F# (or SML/ML or the like) library DLL or requiring someone to download the F# compiler. Another reason is that I'm not as good at writing code in F# as I am in C#. This is a poor reason, but true for now.

One subtle reason that I alluded to in this post is that by doing it in C#, it makes it easier to "sneak" it into the build process of real projects if it ever takes off. That is because most people have C# compilers in their build process. By compiling just C# code and not having to introduce something new, it makes it that much easier.

However, I have planned from the beginning to create the possibility of code emitters be in any language, including F# and Boo. F# has an amazing ability to handle trees and thus could be of use in some more intense phases. The point is that code emiters will just have to implement some known interface. I'd like to have one for common languages like VB.NET as well.

I'm sure OMeta/JS could probably get up and running on .NET with the new Managed JavaScript compiler being built with the DLR. However, it'd be limited to JavaScript's type system. The point with OMeta# is that it uses native .NET 2.0 (generic) data structures that can be used in a wide array of languages and thus interoperate nicely across the .net/clr platform (including the dynamic languages).

As reflected in my check-ins so far, I'm still in the experimenting phase. The design is really rough and has some bad code smells. I'd love it if smart guys like yourself wanted to help out and offer constructive criticism and suggestions on the design :) There is a ton of work in order to achieve the vision.