Categories

Meta

Love the world

Fan is a metaprogramming tool for OCaml. I believe it would be an
invaluable tool for the community when it’s production ready.

If you are attracted by the power of Camlp4 while frustrated with the
complexity, or slowness, I would be very glad that you could join and contribute.

It’s already ready for accepting external contribution, since the
development of Fan would be pinned to OCaml 4.01.0 for a while.

1 Code layout

The source is scattered in four directoriescommon, treeparser, src, cold, and unitest

In the common directory, all sources are written in vanilla OCaml,
it defines basic primitives which dumps Fan’s abstract syntax into
OCaml’s abstract syntax.

Note that all compiler related modules is isolated in this
directory. That said, in the future, if we would like to support
different patched compiler, for example, metaocaml or Lexifi’s mlfi
compiler, only this directory needs to be touched.

In the treeparser directory, it is also written in vanilla Ocaml,
it defines the runtime of the parsing structure.

The src directory is written in Fan’s syntax.
Keep in mind that Fan’s syntax is essentially the same as OCaml’s
syntax, except that it allows quotations, and other tiny
differences (parens around tuple is necessary, not optional).

The cold directory is a mirror of src directory in vanilla
ocaml, though it is much verbose compared with src.

So when you get the source tree, the initial build is typically

ocamlbuild cold/fan.native

The first command ocamlbuild cold/fan.native would build a binary
composed of modules from common, treeparser and cold. Since
they are all written in vanilla ocaml, no preprocessors are
needed for the initial compilation.

If you are a third party user, that’s pretty much all you need to
know. As a developer of Fan, the next command is

./re cold fan.native

This shell script would symlink _build/cold/fan.native to_build/boot/fan, which would be used by compiling src.

Now you can compile the src directory

ocamlbuild src/fan.native

The command ocamlbuild cold/fan.native would build a binary
composed of modules from common, treeparser and cold.

Note that at this time src is a mirror of cold, after preprocessored
by fan, the produced binary src/fan.native should be the same ascold/fan.native.

Now you can

./re src fan.native

But this does not make much sense since src/fan.native is exactly
the same as ./cold/fan.native.

Okay, in most cases, your development would be in such directories:
common, treeparser, src and unitest.

If you only touch common, treeparser or unitest, commit the changes
and send me a pull request.

If you also touch the files in src directory, you should mirror
those changes back to cold directory. Here we go:

./snapshot

Yes, now all the changes in src will be mirrored back to cold
directory.
For a simple change, commit and done. For a complex change that you
are not sure whether it would break anything or not, try to run:

./hb fan.native

The command above would first build src/fan.native using the
current preprocessor _build/boot/fan.

When it’s done, it would first remove the directories _build/src,
and _build/common, _build/treeparser. Then it would set_build/boot/fan to be the new build preprocessor src/fan.native.
After that it would call ocamlbuild src/fan.native to build a new
preprocessor based on the existing preprocessor.

Then it would compare the two preprocessors, if they are exactly the
same, it means we manage to have a successful bootstrap. There is a
large chance that your change is correct.

Then

make test

If everything goes well, it’s safe to commit now.

When the bootstrap fails, generally two cases: 1. the comparison
does not tell you the two preprocessors are the same, the normal
workflow is to repeat the command ./hb fan.native again. 2. It
fails to compile, since you always have cold/fan.native compiled,
fall back to such preprocessor and see where you made wrong.

In this post, we continue discussing syntactic meta-programming
following last post.

My years of experience in different meta-program system(Common Lisp,
Template Haskell, Camlp4) tell me that quosi-quotation is the most
essential part in syntactic meta programming. Though all three claims
they have quosi-quotation support. But Template Haskell’s
quosi-quotation falls far behind either Camlp4 or Common Lisp. For a
decent quosi-quotation system, first, nested quotation and
anti-quotation is necessary, second, like Lisp, every part should be
able to be quoted and antiquoted except keywords position, that’s to
say, each part of the code fragment can be parametrized.

For the notation, we denote Ast^0 as the normal Ast, Ast^1 as Ast
encoding Ast^0, the same as Ast^n.

So in this post, we discuss the quosi-quotation first.

The implementation of quosi-quotation heavily relies on the
implementation of the compiler, so let’s limit the scope of how to get
quosi-quotation done to OCaml.

Let’s ignore the antiquote part, and focus the quote part first, the
essential of quosi-quotation is to encode the Ast using Ast itself in
the meta level: there are different technologies to implement
quosi-quotations, to my knowledge, I summarized three here:

Raw String manipulation

This is the most intuitive way, given a string input, the normal
way of parsing is transform it into a parsetree,

valparse:string -> ast

To encode the meta-level ast, we can do the unparsing again,
assume we have an unparsing function which unparse the ast

valunparse:ast -> string

so after the composition of parse and unparse, you transformed a
string into the meta-level

(parse "3")- `Int "3"
unparse(parse "3")-"`Int \"3\""

Then you can do parse again, after parse(unparse (parse "3")),
we managed to lift the Ast in the meta level. There are serious
defects with this way, First, it’s very brittle, since we are doing
string manipulation in different levels, second, after unparsing,
the location is totally lost, location is one of the most tedious
but necessary part for a practical meta programming system, third,
there is no easy way to integrate with antiquot. This technique is
quite intuitive and easy to understand, but I don’t know any
meta-system do it this way, so feel free to tell me if you know
anyone does similar work 😉

Maintaining different parsers

Unlike the string manipulation, it write different parsers for
different actions. Suppose we are in OCaml, if we want to support
quosi-quotations in such syntax categories

And you want the quosi-quotaion appears in both expr and patt
positions, then you have to write 14 x (2+1) parsers, the parser can
not be re-usable, if you want to support overloaded quotations (I
will talk about it later), then you have to roll your own parser
again. Writing parser is not hard, but it’s not fun either, and
keeping sync up different parsers is a nightmare.

To make things worse, once anti-quotation is considered, for each
category, there are three parsers to write, but anti-quot makes
them slightly different. To be honest, this way is impractical.

Ast Lifting

Another mechanism to do quosi-quotation is that imaging we have a
powerful function:

valmeta:ast^0 -> ast^1

This seems magic, but it’s possible even though in OCaml we don’t
have generic programming support, since we have the definition
of ast.

The problem with this technique is that it requires an explicitAnt tag in the ast representation, since at ast^0 level, you
have to store Ant as intermediate node which will be removed when
applied meta function.

(location in the meta-level is ignored)
If we want to share the same grammar between the Ast^n(n=0,1,2,...),
Ast lifting (a function of type Ast^0 -> Ast^1) is necessary.

Summary

We see the three techniques introduced here to do the
quosi-quotation, Fan adopts the third one, suppose we pick the
third one, let’s discuss what kind of Ast representation we need to
make life easier.

As we discussed previously, introducing records in the Abstract Syntax
brings in un-necessary complexity when you want to encode the Ast
using the Ast itself since you have to express the record in the
meta-level as well.

Another defect with current Parsetree is that it was designed without
meta-programming in mind, so it does not provide an Ant tag in all
syntax categories, so in the zero stage Ast^0, you can not have an
Ast node $x in the outermost, since it’s semantically incorrect inAst^0, but syntactically correct in Ast^n(n=0,1,2,...)

The third defect with the Parsetree is that it’s quite irregular,
so you can not do any meta-programming with the parsetree itself, for
example, stripping all the location from the Ast node to derive a new
type without locations, deriving a new type without anti-quot tags (we
will see that such ability is quite important in Fan)

The fourth defect is more serious from the point of view of
semantics, since in OCaml, there is no way to express absolute path,
when you do the Ast lifting, the time you define Ast lifting is
different from the time you use the quotations

Camlp4’s Ast is slightly better than Parsetree, since it does not
introduce records to increase the complexity.

However, Camlp4’s Ast can not express the absolute path which
results in a semantics imprecise, another serious implementation
defect is that it tries to encode the anti-quote using both two
techniques: either explicit Ant tag or via string mangling, prefix
the string with \\$:, and Camlp4’s tag name is totally not
meaningful.

Think a bit further , about syntactic meta-programming, what we
really care about is purely syntax, Int "3"= should not be different whether it is of type =expr or patt, if we take a
location of ast node, we should not care about whether its type isexpr or patt or str_item, right?

If we compose two ast node using semi syntax ;, we really don’t
care about whether it’s expr node or patt node

letsem a b ={|$a;$b |}

The code above should work well under already syntax categories as
long as it support `Sem tag.

Changing the underlying representation of Ast means all existing
code in Camlp4 engine can not be reused, since the quotation-kit no
longer apply in Fan, but the tough old days are already gone, Fan
already managed to provide the whole quotation kit from scratch. In
the next post we will talk about the underly Ast using polymorphic
variants in Fan, and argue why it’s the right direction.

Thanks for your reading!(btw, there’s a bug in Emacs org/blog, sorry for posting several times)

There are some interesting discussions in the wg-camlp4 mailing list, I wrote a long mail yesterday, I cleaned it a bit, pasted it here

———

I rewrite the whole camlP4(named Fan) from scratch, building the quotation kit and throw away the crappy grammar parser, so plz believe me that I do understand the whole technology stack of camlP4, if we could reach some consensus, I would be happy to handle over the maintaining of Fan, Fan does not loose any feature compared with camlP4, in fact it has more interesting featrues.

Let’s begin with some easy, not too technical parts which has a significant effect on user experience though:

1. Performance

Performance does matter, it’s a shame that the most time spent in compiling the ocaml compiler is dedicated to camlP4, but it is an engineering problem, currently compiling Fan only takes less than 20s, and it can be improved further

2. Building issues

The design of having side effects by dynamic loading is generically a bad idea, in Fan the dynamic loading only register some functionality the Fan support, it does not have any other side effect, each file stands alone says which (ppx , or filters, or syntax) it want to use with a good default option. so the building is always something like ‘-pp fan pluging1 plugin2 plugin3’, the order of pulgings does not matter, also, loading all the plugins you have does not have any side effect, even better, you can do the static linking all the plugins you collected, the building process is simplified.

3. Grammar Extension (Language namespace)

I concur that grammar extension arbitrarily is a bad idea, and I agree with Gabrier that so far only the quotation(Here quotation means delimited DSL, quosi-quotation means Lisp style macros) is modular, composable, and I also agree with Gabrier -ppx should not be used to do syntax overriding (this should not be called syntax extension actually), that’s a terrible idea to do syntax overriding, since the user never understand what’s going on underly without reading the Makefile. So here some my suggestion is that some really conevenient syntax extesion, i.e, (let try.. in) should be merged to the built in parser. quotations does not bring too much heavy syntax (imho). In Fan, we proposed the concept of a hierarchical language name space, since once quotation is heavily used, it’s really easy to introduce conflict, the language namespace querying is exactly like java package namespace, you can import, close import to save some typing.

Here is a taste

———————————————————————————————–

{:.Fan.Lang.Meta.expr| a + b |} ——>

`App (`App ((`Id (`Lid “+”)), (`Id (`Lid “a”)))), (`Id (`Lid “b”)))

{:.Fan.Lang.Meta.N.expr| a + b |} —–>

`App

(_loc,

(`App

(_loc, (`Id (_loc, (`Lid (_loc, “+”)))),

(`Id (_loc, (`Lid (_loc, “a”)))))),

(`Id (_loc, (`Lid (_loc, “b”)))))

———————————————————————————————–

the .Fan.Lang.Meta.expr the first ‘.’ means it’s from the absolute namespace, the N.expr shares exactly the same syntax without location, though

I am pretty sure it’s pretty easy to do in Fan, only Ast2pt (dumping the intemediate Ast into Parsetree) part need to be changed to diffierent compilers.

—————————————————————————————————————-

Now let’s talk about some internal parts of SMP.

Quasi-Quotation is the essential part of SMP, I am surprised so far that the discussion silently ignores the quasi-quotation, Leo’s answer of writing three parsers is neither satisfying nor practical(imho).

Camlp4 is mainly composed of two parts, one is the extensible parser and the other significant part is Ast Lifting. Since we all agree that extensible parser increases the complexity too much, let’s simply ignore that part.

The Ast Lifting are tightly coupled with the design of the Abstract Syntax Tree. People complain about that Camlp4 Ast is hard to learn and using quasi-quotation to do the pattern match is a bad idea.

Let me explain the topic a bit:

Camlp4Ast is hard to learn, I agree, it has some alien names that nobody understand what it means, quosi-quotation is definitely a great idea to boom the meta-programming, but my experience here is for very very small Ast fragment, using the Abstract Syntax Tree directly, otherwise Quasi-quotation is a life saver to do the meta programming.

Luckily the quotation kit has nothing to do with the parser part, it’s simply several functions(I did some simplify a bit) which turns a normal runtime

value into an Ast node generically, such kind functions are neither easy to write nor easy to read,the idea case is that it should be generated once for all, and all the data types in normal ocamlshould be derived automatically(some ADT with functions can not be derived). I bet it’s mostly likely a nightmare if we maintain 3 parsers for the ocaml grammar while two other parsers dumping to a meta-level

So, how to make Ast Lifting easier,

The first guideline is “Don’t mixing with records”,

Once you encoding AST with records, you have to encode the records in the meta level which increases the complexity without bringing any new features, it’s simply not worthwhile.

The second guideline is “Don’t do any syntax desugaring” , syntax desguaring makes the semantics of syntax meta programming a bit weird. Syntax desguaring happens everywhere in Parsetree, think about the list literals, it uses the syntax desuaring, if you don’t use any syntax desugaring, for example, you want to match the bigarray access, you simply needed to match `Bigarray(..)’ instead of

This not only helps the user, but it helps the meta-programming over types to derive some utility types. Take a look at my Ast encoding in Fan https://github.com/bobzhang/Fan/blob/master/src/Ast.ml (it needs to be polished, plz don’t panic when you see variants I use here)

The initial Ast has locations and ant support, but here we derive 3 other Asts thanks to my very regular design. AstN is the Ast without locations, the locations are important, but it is simply not too much helpful when you only do the code generation, but it complicates the expanded code a lot), AstA is the Ast without antiquotations(simply remove the ant branch), it is a subtype of Ast(thanks to the choice we use variants here), AstNA is the Ast without neither locations nor antiquotations), it is a subtype of AstN. In practice, I found the Ast without locations is particular helpful when you only do the code generation, it simplifies this part significantly. The beautiful part is that all the four Ast share the same grammar with the same quosiquotatoin mechanism, as I showed .Fan.Lang.N.expr and .Fan.Lang.expr

I don’t know how many parsers you have to maintain to reach such a goal or it’s never going to happen.

Using variants to encode the intermediate ast has a lots of other benefits, but I don’t want to cover it in such a short mail.

So, my proposal is that the community design an Intermediate Ast together, and write a built-in parser to such Intermediate Ast then dump to Parsetree, but I am for that Parsetree still needs to be cleaned a bit but not too much change . I do appreciate you can take something away from Fan, I think the Parsetree is not the ideal part to do SMP, HTH

I should write this blog long time ago, but I am so adddicted to Fan that I don’t have time to write it, programming is much more fun than blogging.

Anyway, better late than never, XD.

What’s syntactic meta programming?

What’s meta programming?

Meta programming is an interesting but also challenging domain, the essential idea is that “program as data”. Wait, you may wonder that in Von Neumann architecture, program is always data, so to be more precise, meta programming is kinda “program as structured data”, the structured data should be easy to manipulate and generate. Think about Lisp, since it does not have any concrete syntax, its program is always S-expression, a hierachical data structure which is easy to manipulate and process.

Meta-program at different layers

When you write a compiler, the program should have different representations in different stages, think about the ocaml compiler workflow

So, at different stages, the program as a structured data can be processed in different ways.

You can insert plugins per level, for example, the c macros mainly does the token stream transformation, but there is a problem with the token stream that it is not a structured data.

Ther earlier stage you do the transformation, the easier it is to be mapped to you original source program, the later stage you do the transformation, the compiler do more program analysis, but it’s harder to map to the original program. So each stage has its use case.

Here we only talk about syntactic meta programming(SMP), where the layer is in the parsetree or called Abstract Syntax and we only talk about the host language OCaml (OCaml is really a great language, you should have a try!), but some high level design choices should be applied to other host languages as well.

The essential part of SMP

I suggest anyone who are interested in SMP should learn Common Lisp, there are so many brilliant ideas there and forgotten by people outside the community. And two books are really fun, one is On Lisp, the other is Let Over Lambda .

The essential part of SMP is Quasi-Quotation. There is a nice paper introduces the benefits of Quasi-Quotation: Quasiquotation in Lisp.

Here we only scratch its surface a tiny bit: “Quasiquotation is a parameterized version of ordinary quotation, where instead of specifying a value exactly, some holes are left to be filled in later. A quasiquotation is a template.”, breifly, quasi-quotation entitiles you the ability to abstraction over code.

As the paper said, a typical use of quasiquotation in a macro definition looks like

(defmacro (push expr var)
`(set! ,var (cons ,expr ,var)))

Here the “`” introduces a quasi-quotaion, and “,” introduces a parameter(we also call it anti-quote), there are a number of languages which supports quasiquotation except the lisp family, but none of them are even close to Lisp.

One challenging part lies not in quote part, it lies in anti-quote part, however. In lisp, you can antiquote everywhere, suppose you are writing Template Haskell, you can write some thing like this

[| import $module |]

In lisp, it allows very fine-grained quasi-quote.

The other challegning part is nested quosi-quotation. Since meta-program itself is a normal program, when you do meta programming a lot in Common Lisp, you will find you wrote a lot of duplicated meta-programs, here nested quasi-quotation came to rescue.

Discussing nested quasi-quotation may goes beyond the scope of the first blog, but you can have a taste here

Some defects in Lisp Style Macors

Though I really enjoyed Lisp Macros, to be honest, the S-expression as concrete syntax to represent a program is not the optimal way to express ideas.

For the extreme flexibility, you have to pay that for each program you use a sub-optimal concrete syntax.

The second problem is that Lisp is a dynamically typed language, though currently practical type system can help catch only some trivial errors, but they do help a lot.

For a sufficient smart compiler, like SBCL, they did type inference or constraint propgation, and that emits really helpful warnings, the type checking may not be that important there, but that depends on the compiler implementation, some young implementations, like clojure, the compiler is not smart enough to help diagnose, yet.

The third problem is that Lisp macros ignore locations totally, when you process the raw S-expression, no location is kept, in some domains, code generation, for example, location is not that important since you only emit a large trunk of code, in other domains, Ast transformation, location is important to help emit helpful error messages. Keeping location correct is very tedious but necessary, IMHO. Some meta programming system, Template Haskell, ignores locations as well.

How to do SMP in rich syntax language

Now let’s go back to OCaml, the great language XD.

It is the same as Lisp, you have to encode the Ast in the host language, you can encode the ocaml’s Ast using S-expression as well.

S-expression is a viable option, Felix adopts this mechanism. The advantage of using S-exprssion to encode the S-expression is that you can reach the maximum code reuse and don’t need to fight against the type system from time to time.

For example, in Camlp4, once you want to get the location of an Ast node, you have to fix its type, so if have to write a lot of bolierpolate code like this

Everytime you want to fetch the location, you have to fix its type, that’s too bad, the API to process the Syntax is too verbose

But using Algebraic Data Type does have some advantages, the first is pattern match (with exhuastive check), the second is type checking, we do tell some difference between Ast.expr and Ast.patt, and that helps, but you can not tell whether it’s an expresson of type int or type boolean, for example

(Int "3":expr)(String "3":expr)

MetaOCaml can guarantees the type correctness, but there is always a trade off between expressivity and type safety. Anyway, in a staticly typed language, i.e, OCaml, the generated program is always type checked.

So, in OCaml or other ML dialects , you can encode the Abstract Syntax using one of those: untyped s-expression, partial typed sum types, records, GADT, or mixins of records and sum types. there is another unique solution which exists in OCaml, variants.

We will discuss it further in the next post.

Quasi-quotation in OCaml

Quasi-quotation in lisp is free, since the concrete syntax is exactly the same as abstract syntax.

Luckily since expr^1 is a subset of expr^0, so you get the belowing function for free

valmeta_expr:expr^1 -> expr^2

Actually you may find that the category expr^2 is exactly the same as expr^1, so once you have expr^0 -> expr^1, you have expr^0 -> expr^n. (antiquotation will be discussed later).

So the problem only lies into how to write the function expr^0->expr^1, you need to encode the Ast using the Ast itself, this requires that the Ast should be expressive enough to express itself. This is alwasy not true, suppose you use the HOAS, HOAS is not expressive enough to express itself.

If you mixin the records with sum types, you have to express both records and sum types, the Ast lifting is neither easy to write, nor easy to read, with locations, it becomes even more cmoplex, the best case is to do it automatically and once for all.

Suppose you only use sum types, luckily we might find that only five tags are expressive enough to express this function expr^0 -> expr^1, here are five tags

App Vrn Str Tup Com

Here Tup means “tuple”, and Com means “Comma”.

The minimal, the better, this means as long as the changes to the Abstract Syntax Tree does not involves the five tags, it will always work out of the box.

So, to design the right Ast for meta programming, the first thing is to keep it simple, don’t use Records or other complex data types , Sum types or polymorphic variants are rich enough to express the who syntax of ocaml but itself is very simple to do the Ast Lifting.

In the next blog, we may discuss tThe right way to design an Abstract Syntax Tree for SMP.

This will be a series of blogs introducing a new programming language Fan.

Fan is OCamlPlus, it provides all features what OCaml provides and a language to manipulate programs. I am also seeking collaboration if you are interested in such a fascinating project.

It aims to provide the OCaml + A Compiler Domain Specific Language. The compiler domain is a bit special, it’s the compiler domain which can be used by users to create their own domain specific languages, e.g, database query, financial modelling. Our purpose is to make you write a practical compiler in one day, yes, this is not a joke, with the right tools and nice abstraction, it’s very promising to help average programmers to create their own languages to fit their domains in a short term.

The compiler domain is a rather large domain, it consists of several sub-domains, so the compiler of Fan itself also benefits from the Domain specific language(DSL). Unlike other bootstrapping model, all features of the previous version of Fan compiler is usable for the next release. Yes, Fan is written using itself, it’s really Fun 🙂

Fan evolved from the Camlp4, but with a more ambitious goal and different underlying engines, I will compare them later.

Ok, let’s talk business.

Why a new programming language? Because I don’t find a programming language make me happy (yet).

Thinking about how you solve a problem.

It’s mainly divided into two steps.

The first step is to think of an algorithm to tackle the problem, without ambiguity. This is what we call inherent complexity, however fancy the programming language it is, you still have to think of a way to solve it.

The second step is to map your algorithm into your favourite language, i.e, Haskell. Ideally, it should be straightforward, but in reality, it will bring a lot of trouble, and we call it accidental complexity.

What we can do to enhance a programmer’s productivity lies in how to avoid the accidental complexity, the second step.

The problem lies that your favourite language was not designed for your specific domain, it’s a general purpose programming language. When you transfer your ideas into your language, you have to do a lot of dirty work. With the help of modern IDE, people may be alleviated a bit, but programs are not just written to execute, its more functional goal is to help exchange ideas. When you want to understand how a piece of program work, you have to do the reverse-engineering to map your programs back into your ideas. Because when you do the translation from your ideas into your programs, you will lose the big picture, the initial brief ideas are mixed with a lot of noises.

This is a sad fact that how programmers do the work nowadays. 😦

“When you have a hammer, everything is a nail”.

One difference between human being and animals is the fact that man can use tools, the fact that man can not only use tools but also create tools makes human-beings so intelligent. It’s a sad fact that most programmers still live in the cave-age, they can only accept what tools provided. Smart programmers should create a tool which is best fit for their domain.

So, what’s the right way to solve a problem?

When you find some similar problems appear once and again, try to design your language which makes you can express your ideas as isomorphic as possible to the problem’s descriptions, then write a compiler to compile the language. Then it’s done. People who read your program will understand it straight-forward, you write your programs quickly, everything seems to be perfect, everyone is happy.

Wait, you may find that I am cheating, writing a toy-language is not hard, writing a medium language is painful, creating a general purpose language is too hard, and communicating your legacy library with your new language will drive you crazy. So you may say:”let’s forget about it” and shy away.

Yes, that’s true, and that’s why I design a new programming language to address such an issue, remember that creating a language itself is a domain, this domain shares some similar abstractions which should be factored out. And to make life happier, you are extending a general purpose programming language to fit your domain instead of creating a brand new language, and they are compiled into the same intermediate representation, like C# and VB, you never have an inter-operation problem.

Once you finished the language for one domain, your productivity will be boosted exponentially in such a domain.

Fan is created to help you achieve such a goal!

There are different abstraction and DSL solutions, next post I will compare them and talk about the solution Fan chooses and its good and bad effects.