Managing Object Lifecycles in Ironclad

Ironclad 0.8 has just been released. Ironclad is an open source project by Resolver Systems to allow the use of Python C extensions from IronPython.

Ironclad is an implementation of the Python C API in C#, with a bit of assembly language to fool extensions into believing they are calling into Python25.dll. It also reuses as much of the C implementation of the API where possible. When extensions make API calls, Ironclad creates IronPython objects rather than CPython objects, and Ironclad also handles the mapping of extension objects into IronPython.

One of the hardest challenges for Ironclad is that Python extension modules expect to use reference counting for garbage collection, whereas the .NET framework has its own (more efficient!) garbage collector. Ironclad objects that are still being used by the C extension module mustn't be freed even if no .NET references exist to an object, and the reference count of extension objects mustn't be allowed to drop to 0 if they are still being used inside .NET.

A while ago I blogged about how Ironclad handled this. That entry is now well out of date, so William Reade (lead developer of Ironclad) has provided an update.

Hello! My name is William Reade. I'm a colleague of Michael's, and the
primary developer on the Ironclad project, and some months ago I
explained to Michael how I managed object lifetimes in Ironclad. He in
turn wrote a detailed post about it here, and I merrily chipped in to
clarify a couple of points in the comments; the result all looks
perfectly respectable and technically quite neat, apart from the fact
that the approach described is, in fact, dangerously stupid and wrong
in at least one critical respect [1].

So, er, sorry. Let's see if we can do a little better this time. The
fundamental problem that Ironclad needs to solve is: "how do we make
an IronPython object from a type defined in a compiled CPython
extension?"

Well, modulo a few mildly diabolical details, it's actually pretty
easy to construct an IronPython type which forwards all method calls
(and attribute accesses, etc) to an underlying CPython instance. So,
in essence, we solve the problem by creating IronPython objects which
wrap CPython objects -- which combination I will call a "bridge
object", for want of a better term [2].

There's a bit of an impedance mismatch between the two systems, and
the biggest problem thus far has been managing object lifetimes.
CPython objects are reference-counted, and will effectively commit
suicide -- completely deterministically -- as soon as their refcount
hits 0, while IronPython objects are destroyed non-deterministically
by the Garbage Collector: effectively, at random.

Some objects' lifetimes are easy to track: when we create a CPython
proxy for an IronPython object, we store the IronPython object and the
pointer to the CPython stub object in a 2-way map (thus ensuring the
IronPython object will not be GCed), and give the stub a dealloc
method that deletes the association (rendering the IronPython object
-- or CLR object, for that matter -- once again eligible for GC), as
soon as the CPython object's refcount hits 0.

However, when dealing with a bridge object, we need to manage its
constituents' lifetimes very carefully. For reasons that will
hopefully soon become clear, we need to depend on the GC to initiate
the chain of events leading up to object destruction, so we can't let
our map strongly reference the IronPython part; instead, we have to
use a weak reference to it [3].

We can't stop the CPython reference-counting from working, but clearly
we can't afford to have the bridge object's CPython part deallocate
itself when the IronPython part is still alive: referencing freed
memory is not generally considered to be industry Best Practice.

So, the IronPython object IncRefs the CPython object as soon as it's
created, to ensure that the CPython object's refcount never actually
hits 0 again -- and hence that the normal mechanism for CPython object
destruction never gets triggered. Instead, at the point when the
refcount hits 1, I can be sure that no CPython references to the
bridge object still exist. Now, when the IronPython object gets
garbage-collected, it can safely call the CPython dealloc method and
release its unmanaged resources at the same time as the managed object
dies.

And that's fine, as far as it goes: it ensures that CPython objects
can't disappear from underneath their IronPython counterparts.
However, that's not the only failure scenario: the CPython part of the
bridge object depends upon the IronPython part as well [4], so I can't
afford to let wanton garbage-collections destroy the IronPython part
while the CPython part is still being kept alive by unmanaged
references.

So: whenever IronPython code calls into a CPython extension, we need
to translate every parameter to that function into a format
comprehensible to CPython. If the parameter lacked a CPython
representation beforehand, one is created with refcount 1; otherwise,
the existing representation is IncReffed. When the function returns,
each parameter is DecReffed (as is the return value, once its
IronPython representation has been created or retrieved).

However, the IncRefs and DecRefs described above do extra checks for
bridge objects whose refcounts are increasing to 2 or decreasing to 1.
Every time the refcount increases to 2, the IronPython object is added
to a managed set, and it's removed from it again whenever the refcount
drops to 1. This ensures that, if the CPython code grabs a reference
to a bridge object by IncReffing the CPython part, the refcount will
not drop as far as 1 when it's DecReffed on return. Therefore, the
IronPython part will stay in the strongRefs set and remain ineligible
for garbage collection, even if it falls out of scope everywhere else,
and so we can guarantee that it won't disappear while the CPython part
still needs it.

Note

It's important to draw a distinction between the Ironclad IncRef / DecRef operations and the normal reference incrementing and decrementing done by extension modules. The former is fully under our control (of course), the latter are implemented as C macros and directly manipulates a field on objects. As we can't know when a C extension has updated the reference count on an object [5] we are subject to these contortions.

The only remaining drawback is that, once CPython has finished with
the bridge object, it may become an unreferenced zombie: the
strongRefs set keeps the IronPython part, which holds the last
reference to the CPython part, ineligible for GC.

So... I periodically loop [6] over the CPython parts of the bridge
objects, checking refcount. When I find a zombie, I just remove the
-head,-or-destroy-the-brain- IronPython object from the strongRefs
set, and allow GC to take its natural course... and that closes all
the loopholes I have thus far identified [7].

I hope that was broadly comprehensible, and perhaps even interesting
or edifying; if I've been at all unclear, please comment, and I'll
answer your questions as best I can.

Not precisely true: the two objects actually have no direct knowledge of each other. However, the details of precisely what happens are entirely tedious and irrelevant; generally, issues like this will be handwaved/ignored. You Have Been Warned.

Or, at least, it might depend on it, and I can't generally be sure that it doesn't. For example, the CPython type may not define a __setattr__; if, then, I were to assign to random attributes on a bridge object's IronPython part, those attributes would have no CPython representation at all, and would be lost when the IronPython part were GCed, leading to Bad Things.

We could if we modified the macros, but then extension modules would need recompiling - and currently Ironclad maintains binary compatibility with Python C extensions. Thanks to Adam Olsen for forcing William and I to make this clearer.

Book Review: Intellectual Property and Open Source

Of course the term "intellectual property" is itself controversial, but I prefer the more pragmatic approach to copyright reform of people like Lawrence Lessig.

Van Lindberg is an intellectual property lawyer, but also a member of the Python community and is strongly involved (amongst other things) in the organising of the PyCon Conference. But enough about him, what's his book like?

The book is subtitled "A Practical Guide to Protecting Code" and is intended to be a description of US intellectual property law as it stands and applies to software. It is not intended to be a defence of the current state of the law, nor conjecture about what it should be like. Interestingly though, in order to understand the state of the law, it starts with a history of IP law and the philosophies behind it. This is chapter 1: "The Economic and Legal Foundations of Intellectual Property".

Van is not just an IP lawyer, he is also a programmer, and the book is aimed at us poor beleaguered software developers. The writing is down to earth and peppered with programming analogies. It really is highly readable and enjoyable (although dense with information); no mean feat for a book on this subject.

I found the history of IP fascinating - and essential reading to understand why the law views intellectual property rights in some of the same ways as other property rights. In fact Van makes the interesting point that the concept of property is a legal construct, and as theft is the defined as the violation of property rights the RIAA (quite despicable though they are) are technically correct to describe copyright violation as theft.

Having covered the background each of the next chapters tackle a major area of copyright law; chapters 2 & 3 are on patents. These chapters aren't restricted to the subject of software patents (and in fact spend relatively little time on them) but are more on the why and what of patents. Although this knowledge does prove useful in later chapters as you consider the full range of property rights associated with our craft, I couldn't help wanting to move onto the other more directly relevant topics all through this chapter (although useful for completeness I felt that some of the minutiae of the patent documents could have been skipped).

Having said that, Van does a good job of explaining the inherent contradiction (hypocrisy?) in the current US patent system. The ethos of the patent system is to encourage the sharing of knowledge by making it public whilst granting the inventors a temporary monopoly. At the same time it penalises creators if they check for pre-existing patents in their technology field. If you are known to have been aware of a patent, yet are found to be in violation of it, you are liable for much worse damages. As a creator it is important that you deliberately stay unaware of patents in your field!

Another interesting snippet is that if any of the inventors of a technology later patented are left off the patent document then other companies are free to license the technology from the unlisted inventor. Decisions about who to list and who not to list as an inventor in the patent filing may be subject to internal company politics - but they can be vitally important. All of these points are illustrated through the book by relevant case history.

The next chapters cover in turn copyright, trademarks and trade secrets. Most people have some vague impressions of the law around trademarks and copyright - and most of them are wrong! Even if you think you know what these terms mean there is enormous amount of useful and clearly presented information here. Issues of copyright (and even trademarks) are particularly relevant to open source projects. The chapter on trade secrets was also interesting to me, mainly because I don't think we have trade secrets as a legally protected category in the UK (I'm probably wrong - but they are rarely discussed as having legal ramifications). Frustratingly enough although the chapter explains in great detail what trade secrets are, how the legal protection can be removed and so on, it skims over what the legal remedy is in cases where trade secrets are unlawfully abused. Property rights only have meaning in as much as there is a legal consequence to violating those rights - and the remedy available is this protection. All through the chapter I was wondering, what actually happens if the court finds that trade secrets have been misappropriated.

Finally in chapter seven we arrive at contracts and licenses. In some ways this is where the real meat of the book begins. The book has Open Source in the title, and one of the most tangled areas of confusion around open source (or free software or whatever you want to call it) is licensing. In order to understand open source licenses we are going to have to understand the legal validity of licensing intellectual property - and in particular the differences between licenses and contracts. (One of the main differences is the legal remedy available to you - so whether an open source license is a contract or a license can be an extremely important issue.)

One of the chapters on contracts ("So I Have an Idea...") is about the situation for those employed as programmers. Although less common in the UK, it is normal for employment contracts in the US to stipulate that not only are any projects or code you work on in your spare time the property of your employer - but even ideas you have whilst you work from them, even if you only create them later, may belong to them. This obviously has ramifications for contributions to open source projects. The solution is to communicate 'early and often' with your employer, and get their permission in writing (email is fine).

The chapters on open source licensing (including a full chapter on the GPL license!) are for me the most practical. They explain the issues around open source licensing, including how who owns the copyright (particularly for contributions - which make the resulting work a derivative work of the original) affects issues like relicensing. My default license for my projects has always been the BSD License. This is one of the more permissive 'academic' licenses, but makes no stipulations regarding contributions. After reading this book I'm considering switching new projects to the Apache License 2 which explicitly specifies that contributing a patch grants a license for redistribution under the Apache license. This simplifies licensing and redistribution issues.

The final chapters cover the legal status of reverse engineering (surprisingly protected if you do it right - but complicated by the DMCA) and incorporating as a not for profit organisation (if this is relevant to your open source project then not only have you really made it but this chapter will be very useful).

So in summary, an excellent book. Van Lindberg has done an outstanding job of navigating a dry and complex subject in an engaging and precise manner. If you're a programmer, or involved in open source projects, you need to read it - thankfully you'll enjoy it.

Python 3 and Encodings (again)

The full situation is a bit more complex, as you may need to be aware of encodings anywhere you have a text to bytes conversion (or vice-versa). This situation in essence is no different than Python 2, but some of the mechanisms have changed and if you only dealt with byte-strings in Python 2 you just may not have been aware of the issues.

In Python 3, sys.getdefaultencoding() is "utf-8" on all platforms, just as it was "ascii" in 2.x, on all platforms. The default encoding isn't used for I/O; check f.encoding to find out what encoding is used to read the file you are reading.

Notice that the determination of the specific encoding used is fairly elaborate:

if IO is to a terminal, Python tries to determine the encoding of
the terminal. This is mostly relevant for Windows (which uses,
by default, the "OEM code page" in the terminal).

if IO is to a file, Python tries to guess the "common" encoding
for the system. On Unix, it queries the locale, and falls back
to "ascii" if no locale is set. On Windows, it uses the "ANSI
code page". On OSX, it uses the "system encoding".

if IO is binary, (clearly) no encoding is used. Network IO is
always binary.

for file names, yet different algorithms apply. On Windows, it
uses the Unicode API, so no need for an encoding. On Unix, it
(again) uses the locale encoding. On OSX, it uses UTF-8
(just to be clear: this applies to the first argument of open(),
not to the resulting file object).

Note that these aren't the only encoding issues that remain. Normally when you perform operations like listing the filenames in a directory, or looking at command line arguments, you want them as strings. On some platforms these will originally be bytes; what happens if they are undecodable?

This issue has caused much debate. Particularly on Linux [1] it is not even uncommon for a filesystem to have filenames in inconsistent encodings (undecodable filenames) and it was considered unacceptable for a call to os.listdir() to raise an exception in these circumstances (applications would just stop working). The answer (issue 3187) is that operations like os.listdir() honour the type of the argument you pass in. If you pass in a string then you get a list of strings (possibly with undecodable filenames missing). If you pass in bytes then you will get a list of bytes back, guaranteed to be complete but you will need to do the decoding yourself.

As for undecodable command line arguments, it's a bit hard to follow (issue 3023 which is still open) but I think the decision is that Python 3 will refuse to handle them.

Python in the Browser on Linux (with Moonlight)

Programming the browser with Python is fun, and is easy with Silverlight but suffers the deficiency of only being available on the Mac and Windows.

The Mono team have been working on the Linux port of Silverlight called Moonlight. Moonlight 1.0 is already available and includes media codecs supplied by Microsoft to support all the media streaming capabilities of Silverlight. President Obama's inauguration was streamed over Silverlight, and a link on each page showed a link to Moonlight with the caption "Linux-compatible Silverlight Player".

The really interesting version of Silverlight is Silverlight 2, which includes a cut down and sandboxed version of the .NET runtime called the CoreCLR. Now that Moonlight 1.0 is complete work on Moonlight 2.0 is well underway. JB Evain is on the Mono team, and has demonstrated the capabilities of the Moonlight 2.0 port by running some IronPython examples: