I am considering writing a series of articles on compiler development using Spirit-2. It will be based on the series of BoostCon talks from 07 to 09. From the humblest calculator to a full blown programming language. Details are still sketchy at this point. All I can ascertain is that the final language shall use LLVM as a back-end. There are lots of questions I need to ask in order to get the bases covered. I’d like to solicit feedback and comments. Will it be an imperative language like C or a functional language like Scheme? I love the line oriented syntax of Python, but is a free format syntax with the all too familiar braces or begin/end blocks be better? Will it be statically typed like C++ or dynamically typed like Python or LISP? Will it be OOP? How complex or simple should it be? Remember, this is meant for instruction. Is a toy language good enough? How about basing it on a simple ISO-standard language like Pascal? Or fun languages like Javascript? How about features? Type inference? Lambda? Operator overloading? Etc., etc., etc.

Let me know what you think…

Share this:

63 Responses to “Build a Compiler”

I’d like to see the back end target a real processor. Perhaps an 8 bit microprocessor like a 6502 or a 6809 or something more recent. (Although 6502/6809 are so well documented on the web that they are still useful as an example of a real processor.)

Also, presumably the point of this exercise is to teach about Spirit and how to use it to implement a language, not teach language design or theory. So stick with a well known language with well known semantics. Pascal or TinyC would be my choice. They are sufficiently rich to demonstrate the necessary principles, but not so esoteric that people won’t relate to the language and the compiler implementation. They are also sufficiently simple that you don’t lose people in the scope of the full implementation.

Re: Targeting a real processor, I’d like to focus more on the front end. LLVM is an easy target, is modern, and can target a multitude of processors. In this day and age, you no longer directly target a specific processor. The back end itself is a whole new ballgame with all the intricacies (e.g. JIT, optimizations, etc.). In as much as you use a parser like Spirit to do your front end parsing, the best practice today, IMO, is to use another tool or library to do the back end. So, bottom line: I’d rather stick to the plan and use LLVM. Hey, it’s the 21st century! 🙂

I have just spent some time reading through http://www.llvm.org to learn about LLVM. At first I was under the misimpression (inferred from its name) that it was just a virtual machine and not a native code generator. Now I know better! I’m glad I bothered to read about it because I’m working on an OSS fractal generator that has a native language for expressing various fractal iterating formulas. My intention was to rewrite the parser using Spirit. Now I see that I can get an efficient back end using LLVM, so thanks for bringing that to my attention.

What level of compiler writing knowledge are you expecting of your reader? If next to none then I’d first do a statically typed reserved keyword language that is mostly context-free. A toy bigger than mini-c is probably the first compiler to write, a full language might be C89 [not the lib, the language itself ]is fairly simple, for a real language I am not that familiar with C99 as I was into C++ by then:) but as I recall C89 is mostly context free.

That will be fantastic. I am using spirit for parsing text data, but my ultimate goal is to have my own compiler for the mix of vb and javascript-like language that is used in one of the vertical industries. I am just a “user”, not a compiler specialist, and without specialized knowledge I think that such task is too big of a goal for me, but hopefully with the guidance of this planned series, I can do it.

Ideally this series will demonstrate a workflow of taking the source code and produce something that can be executed on desktop OS. Or linked with something to produce the executable.

It gets better than that. How about a compiled scripting language or DSL (Domain Specific Language), with the syntax of your choosing, that you can embed in your application and can be interpreted, JIT compiled, or compiled into tight, fast, highly optimized machine code. Interested?

What about Google’s language ( http://golang.org/ ). It is designed to be C-like, syntax is slightly different to make it fast and easy to parse (it does parse *fast*, would be fun to benchmark with a Spirit version), and the language is fully defined. No templates or anything sadly.

Personally I would prefer a Common LISP style language, but that is *too* easy to parse in Spirit. 😉

If Scheme were the target you would have a spec to work from; R5RS or 6.
A repl with an LLVM backend would be something useful, *not* a toy.
Numbers require a bit more parsing than most any other language.
Same with Identifiers.
Skipping: space, blank, custom? (cons . syntax)
A sexp is a tree -> AST.
Pretty printing with Karma?

What about Go though? I think if we made an embedded LISP compiler (easy to drop into programs to use as a compiled scripting language) would be pretty awesome though, we could even go on a tangent and make an Actor variant of LISP. :p

Go does seem useful though, a lot of people want it on LLVM, its spec is very well set, and it would get a lot of use out of it.

A sexp based language ala Clojure would also be pretty cool! I’m partial to common lisp but that is a serious undertaking if you want to conform to the standard which also has a lot of cruft that could/should be *snipped*. Besides there are already some very good compilers for common lisp such as sbcl/clozure/ecl/etc. Maybe a common lisp subset including reader macros… that would be very cool!

Right. A simple OOP language. I think this is becoming the consensus. Perhaps incremental evolution of that base language would yield something cool and never been seen before. A good strategy is to start something simple yet practically useful by itself (read: not a toy), then let people innovate.

That is what I am doing in my LISP’ish language. I is only similar to LISP in syntax and its macro and read-macro support, but mine is strongly typed, concurrent, etc… Even most of the function names are different (like really, car and cdr? Who uses ancient register names to handle lists anymore). Along with complete function overload based on types (with an auto (template like) and variant support, although now that I have seen Go I may have to alter it further). I do need to completely rewrite mine now that I learned more, so it would be fun. 🙂

For something with a defined Spec, I still think Go would be fascinating, and a *lot* of people out there want an llvmGo compiler.

Haskell is lovely. I look at Haskell programs like poetry. I don’t think it’s good for teaching compiler development though. Thoughts? I love Python too, but I am not quite fond of dynamic typing, to be honest.

Nor am I, which is why I have been making a strongly-statically-typed Lisp variant. Most bugs with dynamically typed programming languages are because of the dynamic typing, and everyone treats dynamically typed languages as static anyway, thus I see no point. I like a language where you can set a variable without needing to specify a type, based on what you assign to it (Google’s Go does that well for a C++’ish syntax).

I think Haskell is a fairly simple language to understand. It offers a lot of more complex possibilities but fundamentally, as a language it is easy to define and explain to others. And hence is a good language to target when teaching compiler development.
There is a plethora of examples of LISP or its variants, in general. a different functional language would be very welcome :).

As for Python, again, its easy to explain and hence easy to discuss teach how to write a compiler for.

As to language, something like C++, Scala, Haskell, or Perl (!) would be interesting but quite a lot of effort. But at the same time, I do think it would be nice if it were not a ‘toy’ language but instead something truly useful, because (ideally) it could eventually become a used-in-the-real-world compiler instead of just a learning exercise. Languages that come to mind that might be the right fit between too hard and too easy – Google’s Go, C89, Clojure, or SML, say.

I think using LLVM was the backend would be an excellent demonstration, since LLVM is a quite reasonable target for anyone actually implementing a new language.

How about the R programming language (http://www.r-project.org/)? It has the curly bracket syntax but is rather scheme-like in scope etc. It is a very commonly used language for data analysis/simulation that has been gaining a lot of traction in recent years. The interesting thing about this language is that its target market has very computationally intensive demands yet the implementation is a pure interpreter. If you could produce a repl with JIT and/or a AOT compiler for R with an llvm backend that would be marvelous!

I dealt with Lua for years, implemented it as a scripting language in a variety of things. I can truly say I hate it… If you want reasons why then just ask and I will do one of my normal very long posts as to why.

>It gets better than that. How about a compiled scripting language or DSL >(Domain Specific Language), with the syntax of your choosing, that you can >embed in your application and can be interpreted, JIT compiled, or compiled >into tight, fast, highly optimized machine code. Interested?

YES !!. This would be brilliant. It would help for the DSL to be able to support class-like definitions for arbitrary types, e.g.

Good interoperability and intermixability with C++ is always sadly lacking with embeddable languages. Boost.Python and such helps a lot with regards to Python, but it is *so* heavyweight, especially since there is no need to be when using LLVM. Hence why I have a very light intermixing and type-safe layer between my C++ and LLVM scripting language, we should do the same. 🙂

I would like to see a C++ compiler. It would be extremely useful to have something that is able to understand at least interfaces, so (for example) writing Mock object support for Boost Test could be supported by it. Current approaches with C++ (since we have no introspection and whatever else it is called in Java) are clumsy, hard to read, non-intuitive.

I would also like to be able to do an “interactive compiler”, something that would be able to support an IDE that could turn code into some sort of p-code, and support template metaprogramming debugging, and “magical” refactoring automation and so on. I guess it is quite close to what LLVM does, but I want it to work instantly as I type. So if I type:

std::string s;

I want it to tell me: hey, for this you need to #include , even if string is included by coincidence by another header (but string is not part of the interfaces specified by those headers, only the implementation, such as a member variable)…

ww said:
> I would like to see a C++ compiler. It would be extremely useful to have
> something that is able to understand at least interfaces, so (for example)
> writing Mock object support for Boost Test could be supported by it. Current
> approaches with C++ (since we have no introspection and whatever else it is
> called in Java) are clumsy, hard to read, non-intuitive.

However, a C++ compiler is one of the most difficult compilers you can write, and would not at all be any kind of a good intro into writing a full language.

ww said:
I would also like to be able to do an “interactive compiler”, something that
> would be able to support an IDE that could turn code into some sort of
> p-code, and support template metaprogramming debugging, and “magical”
> refactoring automation and so on. I guess it is quite close to what LLVM
> does, but I want it to work instantly as I type. So if I type:
>
>std::string s;
>
>I want it to tell me: hey, for this you need to #include , even if string is
> included by coincidence by another header (but string is not part of the
> interfaces specified by those headers, only the implementation, such as a
> member variable)…

That is pretty near impossible because an implementation file can potentially include files from anywhere on the entire filesystem, as well as it is quite possible for their to be multiple, non-compatible definitioins of something like std::string (unlikely for this specific one, but possible). Remember, in C/C++ includes files into other files, it does not import definitions like Java does, this gives C/C++ much more power, but makes it quite literally impossible to do what you ask.

We’re thinking about that as well (ISO EBNF). It would be nice to have it emit: 1) interpreter that can be immediately executable 2) LLVM parser code 3) C++ Spirit code. This would be solve the remaining issues with Spirit: 1) Hard to debug error messages and 2) Long compile times.

RAD spirit would be cool. A small language where you could connect a qi grammar and a karma grammar and have it typechecked for compatible attributes could make for a really nice and simple text transformation tool.

I’ve been working on an ISO EBNF parser. It works pretty good. I think I have all
of the syntax parsing correctly, including the alternate symbols (/ /) / ! etc.
I’ve got a visitor written that just writes the data structures back out as
semi-formatted EBNF.

I’ll nominate Java, since I’d really love to see a grammar I can then use to derive open-source source-to-source tools for taking code out of Java and several other java-like languages and into C/C++ without a tool-chain dependence on llvm or a java compiler/runtime.

Or +1 for C, and particularly for just updating Hartmut’s example from http://spirit.sourceforge.net/ as a full example of up-porting to 2.1 from 1.x. Separately, the tutorial aspects of walking a C ast, handling crufty bits in “real” languages, etc could be built on top of the updated base of parsing C.

I guess it comes back to the goal of your project. If the goal is to introduce Spirit by example, I think something like Scheme that is quite small yet fairly widely-known is not a bad idea.

On the other hand, something more ALGOL like could have more appeal, since a lot of people use those kinds of languages and could see familiar constructs. The advantage might be that learners could focus less on trying to figure out the language being implemented, and more on how Spirit is used to implement familiar constructs.

On the other other hand (am I out of hands, yet?) I don’t think anyone’s mentioned Forth. 😉

lol! I used to know a fellow (Brian Prentice) who had worked for Boeing, and he told me that whenever they got a new system the first thing he’d do was to implement a Forth for it, since as far as he was concerned that was the quickest way to bootstrap pretty much everything.

Lisp like languages are too easy to parse, therefore would not be a very good exercise in that area. My vote is for JavaScript. Building something on par with the other speedy JavaScript engines out there would be very interesting.