Tuesday, October 14, 2008

Sprint Discussions: C++ Library Bindings

At the beginning of this year, PyPy grew ctypes support, thanks to generous
support by Google. This made it possible to interface with C libraries from
our Python interpreter, something that was possible but rather tedious before.
What we are lacking so far is a way to interface to large C++ libraries (like
GUI libraries). During the sprint we had a brainstorming session about possible
approaches for fixing this shortcoming.

Those all have the property that they produce some code that is then compiled
with a compiler to produce a CPython extension. The produced code also uses
functions from CPython's C-API. This model is not simple to use for PyPy in its
current state. Since PyPy generates C code automatically, a fixed C-level API
does not exist (it is not unlikely that at one point in the future we might have
to provide one, but not yet). At the moment, PyPy very much has a "Don't call
us, we call you"-approach.

A very different approach is followed by the Reflex package, which is
developed at CERN (which has an incredible amount of C++ libraries). It is not
mainly intended for writing Python bindings for C++ libraries but instead
provides reflection capabilities for C++. The idea is that for every C++ shared
library, an additional shared library is produced, which allows together with
Reflex to introspect properties of C++ classes, methods, etc. at runtime. These
facilities are then used for writing a small generic CPython extension module,
that allows CPython to use any C++ library for which this reflection information
was generated.

This approach is a bit similar to the ctypes module, apart from the fact
that ctypes does not use any reflection information, but the user has to
specify the data structures that occur in the C code herself. This makes it
sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the
generic extension module and could then use any number of C++ libraries. Of
course some more evaluation is needed (e.g. to find out whether there are any
restrictions for the C++ code that the library can use and how bothersome it is
to get this reflection information for a large library) but so far it seems
promising.

At the beginning of this year, PyPy grew ctypes support, thanks to generous
support by Google. This made it possible to interface with C libraries from
our Python interpreter, something that was possible but rather tedious before.
What we are lacking so far is a way to interface to large C++ libraries (like
GUI libraries). During the sprint we had a brainstorming session about possible
approaches for fixing this shortcoming.

Those all have the property that they produce some code that is then compiled
with a compiler to produce a CPython extension. The produced code also uses
functions from CPython's C-API. This model is not simple to use for PyPy in its
current state. Since PyPy generates C code automatically, a fixed C-level API
does not exist (it is not unlikely that at one point in the future we might have
to provide one, but not yet). At the moment, PyPy very much has a "Don't call
us, we call you"-approach.

A very different approach is followed by the Reflex package, which is
developed at CERN (which has an incredible amount of C++ libraries). It is not
mainly intended for writing Python bindings for C++ libraries but instead
provides reflection capabilities for C++. The idea is that for every C++ shared
library, an additional shared library is produced, which allows together with
Reflex to introspect properties of C++ classes, methods, etc. at runtime. These
facilities are then used for writing a small generic CPython extension module,
that allows CPython to use any C++ library for which this reflection information
was generated.

This approach is a bit similar to the ctypes module, apart from the fact
that ctypes does not use any reflection information, but the user has to
specify the data structures that occur in the C code herself. This makes it
sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the
generic extension module and could then use any number of C++ libraries. Of
course some more evaluation is needed (e.g. to find out whether there are any
restrictions for the C++ code that the library can use and how bothersome it is
to get this reflection information for a large library) but so far it seems
promising.

11 comments:

I've done a fair amount of complicated Boost.Python wrapping, and also implemented a small replacement for it with most of the complexity removed. There are two main reasons why Boost.Python is so complicated:

1. It supports arbitrarily complex memory and sharing semantics on the C++ classes (and is runtime polymorphic on how the memory of wrapped objects is managed).

2. It supports arbitrary overloading of C++ functions.

If you remove those two generality requirements (by requiring that wrapped C++ objects are also PyObjects and banning overloading), it's possible to write very lightweight C++ bindings. Therefore, I think it's critical to factor the C/C++ API design so that as much of it as possible is writable in application level python on top of a small core that does the final C++ dispatch.

For example, if you wrap a C++ vector class with a bunch of overloads of operator+ in Boost.Python, each call to __add__ has to do a runtime search through all the overloads asking whether each one matches the arguments passed. Each such check does a runtime search through a table of converters. It would a terrible shame if that overhead isn't stripped by the JIT, which means it has to be in python.

I think a good test library for thinking about these issues is numpy, since it has some memory management complexity as well as internal overloading.

Once that is done, it's lots easier to interface with the outside world.

For a lot of C++ apis I find it easy enough to write a C api on top of it.

In fact many C++ apis provide a C API. Since that makes it easier to work with different C++ compilers. As you probably know different C++ compilers mangle things differently.

It is possible to look at C++ code at runtime. You just need to be able to interpret the C++ symbols. I know someone did a prototype of this for vc6 on windows. He parsed the symbols, and then created the functions at run time with ctypes. However the approach is not portible between platforms, compilers, or even different versions of compilers. Of course this didn't allow you to use many of the C++ features, but only some.

If you look at how swig works, you will see it kind of generates a C API for many C++ things.

For libraries, it is custom to provide a C API. It just makes things easier.

you might want to look at PyRoot [1,2,3] which is using the Reflex library to automatically wrap (and pythonize) the C++ libraries/types for which a Reflex dictionary has been (beforehand) generated.

theoretically any piece of C++ can be wrapped as Reflex is using gccxml[4] to extract informations from a library and to generat the dictionary library.

Using it in one of CERN's LHC experiment which makes heavy (ab)use of templates (Boost) I can say that we almost had basically no problem.Usually the only problems we got were either at the gccxml level (resolution of typedef, default template arguments,...) or at the gccxml-to-reflex level (mainly naming conventions problems interfering with the autoloading of types at runtime)

Being a client of gccxml is a rather annoying as the development is... opaque.

I know the Reflex guys were investigating at some point to migrate to an LLVM version (with GCC as a frontend) to replace gccxml.

There's been some (small) discussion in the SWIG project of making an alternative output method which creates a simple C API for a C++ project, and wraps that with ctypes (generating the python side of the ctypes bindings, too). So far, this is purely theoretical, but all the pieces needed to do it are present in the SWIG source code. If reflex doesn't work out, this might be a reasonable alternative approach.

Wow. A lot of very informative posts. We'll definitely look to evaluate more what you all posted. Also, in case you want to discuss more, mailing list is usually better place for discussions. Feel free to send new ideas or more detailed info there.

illume: Adding a C-API is rather hard, and probably not on our todo list, unless somebody pays for it :-).

anonymous: From a quick glance I am not sure Elsa would really help. Yes, you can get use it to parse c++ headers and get information about it. But as far as I see it, you cannot use it to create shared libraries that can be used to dynamically construct classes and dynamically call methods on them. Besides, the idea is to have a solution that works on both CPython and PyPy. Reflex already has a way to bind C++ libraries to CPython, so we only need to do the PyPy part.

Anyway, if anybody is interested in more detailed discussions, we should all move to pypy-dev.

what the PyROOT/Reflex layer is doing is looking at the dictionary for the std::vector(FooKlass), discovering that there is a pair of functions 'begin' and 'end' and it figures out one can create a python iterator from that pair.

anyways, as Maciej pointed it out, we could try to move this discussion here[1] or there[2]...