Tamás Szelei's coding blog

Implementing a code generator with libclang

Introduction

The following article covers the process of implementing a practical code generator for C++ in detail. You can find the full source code for the article on GitHub.

A code generator is a very useful asset in a larger C++ project. Due to the lack of introspection in the language, implementing the likes of reflection, script binding and serialization requires writing some sort of boilerplate that essentially keeps the data which is otherwise thrown away by the compiler. These solutions are either intrusive (heavily macro-based, thus hard to debug and require weird syntax in declarations) or fragile (the boilerplate must be constantly updated to follow the actual code, and might break without warning). One way to improve the robustness is to automate writing this boilerplate. In order to achieve this, we need to parse the code somehow, in other words, understand what information to keep. However, parsing C++ is an extremely complex task, and with the copious amount of weird corner cases, we are in for quite a ride if we attempt to do so.

Attempts to parse a “good enough” subset of C++ generally fail or require the project to follow strict coding guidelines. That is, to avoid syntax that the parser can’t understand – and may break at any time when someone commits code that doesn’t follow these guidelines. The excellent LLVM project offers a tool to amend this problem: libclang1. Since libclang ultimately calls the same bits of code that the clang C++ frontend calls, it will understand everything that is valid C++. Recent builds even support C++1y (C++14, if all goes well) features. It has one little flaw: the official documentation is pretty much only the Doxygen-generated reference, which is very useful, but not as an introduction to the usage of the library; due to the complex nature of the problem, it has a steep learning curve.

I am going to present the process of implementing a practical code generator for C++ using libclang and its Python bindings. “Practical” in the sense that it is not a general solution. What I’m presenting here is not meant to be taken as the concrete implementation I made but rather as one possible way of solving a problem, a detailed example. Using these ideas, it is possible to create an all-encompassing reflection solution or to generate code for existing libraries of any purpose. The aim is to write natural C++ syntax with minimal intrusive bits to provide the functionality. I encourage readers to experiment with the code and to try and implement a code generator from scratch.

The example problem

In our example, we are implementing automatic script binding, so we don’t need to write and maintain binding code by hand. We also want to be able to omit certain parts of the source so that they are not taken into account when the binding boilerplate code is generated2. Keep in mind that his article is not about automatic script binding3. It is just one thing that can be done with code generation and used as an example here.

In the example, we are going to work with the following C++ class declaration:

Now let’s see the simple solution. We will utilize Boost.Python, a seasoned and battle-tried library, which allows us to write binding code in a very expressive manner. The following is the entire binding code in a separate source file which we will link with our executable. We also define an init_bindings() function, which does what its name says.

After init_bindings is called, our TextComponent class is available to use in Python. The above code expresses exactly what we wanted to achieve: one constructor, and two member functions. We simply don’t bind the superSecretFunction, because we don’t want that to be available from Python.

All of the above is what we would do in a typical project to make a class scriptable. Our aim is to generate this code automatically.

Automation

Traversing the AST

Now we are going to inspect the abstract syntax tree (AST) of the header file and use that information to generate the above binding code.

Traversing the AST is performed with cursors. A cursor points to a node in the AST, and can tell what kind of node that is (for example, a class declaration) and what are its children (e.g. the members of the class), as well as numerous other information. The first cursor you need points to the root of the translation unit, that is, the file you are parsing.

The index object is our main interface to libclang, this is where we normally initiate “talking to the library”.

The parameters of the parse call

The parse function takes a filename and a list of compiler flags. We need to specify that we are compiling a C++ header (-x c++), because otherwise libclang will assume it is C, based on the .h extension (and consequently produce an AST that misses most parts of our header file). This option will cause libclang to preprocess our file (resolve macros and includes) and then treat it as a C++ source. The other options should be self-explanatory: setting the standard we use in the parsed source, and providing the __CODE_GENERATOR__ macro definition which will come handy in most implementations. Now, back to processing the AST – recursively. The AST of the TextComponent class can be dumped like this (see dump_ast.py):

Comparing this syntax tree dump to the source above should help with understanding what libclang does when parsing the source code.

Did you mean recursion?

The libclang C API has a visitor-based way of traversing the AST. This is also available from the Python bindings via the function clang.cindex.Cursor_visit, but we are going to utilize the more Pythonic, iterator-based approach, which is an addition of the Python bindings. The ‘T’ stands for tree in the abbreviation, and the most straightforward way to process such a data structure is to recursively traverse it. It is pretty simple:

def traverse(cursor):
# ...
# do something with the current node here, i.e.
# check the kind, spelling, displayname and act based on those
# ...
for child_node in node.get_children():
traverse(child_node)

Useful properties of cursors

Cursors provide a rich set of information about each node in the AST. For our purposes, we will use the following:

kind: The kind of node we are looking at in the AST. Answers questions like: Is this a class declaration? Is this a function declaration? Is this a parameter declaration?

spelling: The literal text defining the token. For example, if the cursor points to a function declaration void foo(int x);, the spelling of this cursor will be foo.

displayname: Similar to spelling, but the displayname also contains some extra information which helps to distinguish between identically spelled tokens, such as function overloads. The displayname of the above example will be foo(int).

location.file: The source location where the node was found. This can be used to filter out included contents from the source file being parsed, because usually we are interested in that.

If you are implementing something different, you might find the following properties useful, too: location,extent. Sometimes the only way to get a particular string is to read the source directly. With location and extent you can find the exact point in the file that you need to read.

Poor man’s code model

While it is entirely possible to generate code in an online manner, I find it clearer (and more reusable) to actually build a code model in which the C++ classes, functions (and whatever else is interesting for your purposes) are objects.

The following piece of code illustrates what I mean here and also showcases how a (very thin) object model of a class is constructed:

It all really boils down to traversing a tree and filtering for certain elements. The for loop above does just that: if the current node is a C++ method (member function), then construct and store a Function object using the information found in that node.

This is a very simplistic code model: classes have names and member functions, and member functions have names. It is possible to gather much more than that, but for our purposes, this is mostly enough. By the way, that above is almost half of all the code we need in our code generator!

Now let’s see the other half: how the classes are built. Reusing the traversal approach:

One important step here is that we are checking the location of the node. This ensures that we are only taking the contents of the file being parsed into account. Otherwise, we would end up with a huge mess of an AST due to the includes being , well, included. To put it all together, we would call the above function with the translation unit cursor (see above), to find all classes and their functions in the source file:

classes = build_classes(translation_unit.cursor)

Code generation

Now that we have a code model, we can easily process it and generate the code we need. We could iterate over the list of classes, print the class_… part in the binding code, then iterate their member functions… etc. This approach can work and is easy to implement, albeit not very elegant. We are going to do something way cooler: we will use templates to generate our code.

Templates. Duh?

Of course you already saw we are using templates in the Boost.Python code. What is so cool about that? Oh, I didn’t mean C++ templates. Generating text from data is a well-understood problem, with lots of great solutions from the web programming world. Mako templates is one of them4, with prominent users like reddit and python.org. Sceptical? Take a look at the template code we will use:

This template directly uses the code model we defined above, c.name refers to the name of the Class object. It is easy to see how even this simple code model can be used to generate code for various purposes. You can register functions not only for script binding, but also reflection libraries which allow a huge variety of dynamic uses (serialization, RPC, thin layer for script binding etc.). Save that template to a file named bind.mako, and then using it is really just a few lines:

That is almost what we wanted. The remaining problem is that our script also bound the superSecretFunction, which we meant to hide.

Hiding member functions

Now, to tell our code generator script that some parts of the AST are different than others (namely, we want to ignore them), we need to use some annotation magic5. When I was experimenting, I tried using the new(-ish) C++11 [[attributes]] to mark functions, but libclang seemed to omit unknown attributes from the AST. That is correct behavior, as far as the standard is concerned: compilers should ignore unknown (non-standard) attributes. Clang simply chooses to ignore them while building the AST, which is unfortunate for our case. Luckily, clang has a language extension which can be used to apply string attributes to many kinds of nodes, and that information is readily available in the AST. With a graceful macro definition, we can use this extension without losing compatibility with other compilers. The syntax is the following:

__attribute__((annotate("something"))) void foo(int x);

Admittedly, that is a bit lengthy. We do need to employ conditional compilation for portability anyway, so let’s do the following:

This way our code generator will see the annotations, but compilers won’t. To use the annotation, we will need to revisit some of the code above. First, we write the annotation where we want it in the source:

We can filter out the unwanted elements (classes from includes and functions that are meant to be hidden) in two places: during building the code model or when our mako template is rendered. Choosing either is a matter of taste, I voted for the latter (I feel it is better the keep the code model consistent with the source file). The modified template looks like this:

At this point, the generated binding code is almost identical to what we wrote by hand. There is one last issue we need to cover.

Hiding non-public members

Our example source only had public member functions. In a real-world scenario, this rarely happens. The code generator did not take access specifiers (public, private, protected) into account, but it would be very important to do so. The generated binding code would not compile if it would contain non-public members. Unfortunately, at the time this is written the Python bindings do not expose the access specifiers on the cursors. I recommend using my patched cindex.py to access this information6. The last revision of our Class constructor filters out the non-public members:

With this, our solution is pretty much complete. The generated source will look like the one we wrote by hand. You can find the full, integrated project in the git repository.

Another option to parse C++ header files is the CppHeaderParser Python package. It has some problems with templates and probably other syntax corner cases as well. The upside is that it is a single .py file, thus very easy to integrate in your project ↩

Such a feature is useful when the C++ layer is meant to serve as a low level implementatation of features tied together by scripts. In some cases we want to keep functions public while hiding them from the scripting interface (e.g. debug functionality). ↩

There are plenty of great solutions for automatic script binding, far more versatile than what we implement here (SWIG is one of them). ↩

There are several alternatives to Mako of course; if you are aiming to minimize the dependencies of your project, you might want to check out Titen. Another approach could be to avoid template engines overall and just use Python string formatting with {variables_like_this}. ↩

It is also possible to simply use the __CODE_GENERATOR__ macro to hide some parts, but that is quite ugly and intrusive, one of the things we wanted to avoid ↩

I submitted a patch to the clang frontend project during writing the article and I will update the post if it gets approved ↩

Your article is great and very helpful. I have one question. If follow your code “index.parse(sys.argv[1], [‘-x’, ‘c++’, ‘-std=c++11′, ‘-D__CODE_GENERATOR__’])”, it will also parse all #include “*.h” files. How can I set index.parse, and make it cannot parse #include “*.h”? Thank you very much!

Tamás Szelei

Thanks for the kind words.

You can use ‘-x cpp-output’ in place of the ‘-x c++’ option to omit the preprocessing step (that is when includes happen), but that will also leave any macros without expansion in your code. That potentially leads to parsing errors. A better solution is to do what I do in the script: check the name of the file to which the cursor belongs so that you only process the nodes in the file. See the highlighted line: https://github.com/sztomi/code-generator/blob/master/src/boost_python_gen.py#L36

Ciro

Hi Tamas, I follow your suggestion to check the name of the file. It works very well. Thank you very much!

Ciro

Hi Tamas, I want to find all function calls. For example, there are printf(), sprintf() and malloc(), but I only can find printf() and sprintf() by libclang, and cannot find malloc(). I also use your dump_ast.py, I cannot find malloc in AST. I can only find printf and sprintf marked as CALL_EXPR. Printf and malloc both belong to glibc, why AST cannot show malloc? Or I miss something?

Tamás Szelei

No idea. You are probably missing an include or something. Try to compile your code with clang. Once you figured out the flags, pass them to libclang.

Ciro

You are right, I do miss the include. However, malloc() still cannot be found in some situation. For example, char *ptr; ptr = (char *)malloc(20); malloc can be found as CALL_EXPR. If char *ptr; ptr = malloc(20); malloc cannot be found. AST doesn’t show the related malloc.

Manish Mishra

Write a windows programe to find size of the window created with x position 100 & y position 150?
This is Q for my college so what can be coding in C++?

Tamás Szelei

Your question does not belong here (not does make much sense). I suggest you try a forum or StackOverflow (but put more effort into asking otherwise it will be promptly deleted).

Fernando Lener

Awesome! I will do something similar at home. Thank you!

Sun

Hey, your article is super useful for me. I also want to know whether you have any idea about how libclang use Json compilation data base? Thanks!

Tamás Szelei

Thanks. I haven’t used this functionality, but libclang does expose functions to deal with compilation databases. This is a link to the reference: http://clang.llvm.org/doxygen/group__COMPILATIONDB.html
This seems pretty “batteries included” to me, but if you have extra needs there is always libtooling.

Sun

Thanks. I don’t find how the libclang use compilation databases. Now I change to use libtooling.