Thursday, 7 August 2008

Introducing CapPython

Python is not a language that provides encapsulation. That is, it does not enforce any difference between the private and public parts of an object. All attributes of an object are public from the language's point of view. Even functions are not encapsulated: you can access the internals of a function through the attributes func_closure, func_globals, etc.

However, Python has a convention for private attributes of objects which is widely used. It's written down in PEP 0008 (from 2001). Attributes that start with an underscore are private. (Actually PEP 0008 uses the term "non-public" but let's put that aside for now.)

CapPython proposes that we enforce this convention by defining a subset of Python to enforce it. The hope is that this subset could be an object-capability language. Hopefully we can do this in such a way that you can get encapsulation by default and still have fairly idiomatic Python code.

The core idea is that private attributes may only be accessed through "self" variables. (We have to expand the definition of "private attribute" to include attributes starting with "func_" and some other prefixes that are used for Python built-in objects.)

As an example, suppose we want to implement a read-only wrapper around dictionary objects:

A self variable is a variable that is the first argument of a method function. A method function is a function defined on a class (with some restrictions to prevent method functions from escaping and being used in ways that would break encapsulation).

We also have to disallow all assignments to attributes (both public and private) except through "self". This is a harsher restriction. Otherwise a recipient of a FrozenDict could modify the object:

This scheme has some nice properties. As with lambda-style object definitions in E, encapsulation is enforced statically. No type checking is required; it's just a syntactic check. No run-time checks need to be added.

Furthermore, instance objects do not need to take any special steps to defend themselves; they are encapsulated by default. We don't need to wrap all objects to hide their private attributes (which is the approach that some attempts at a safer Python have taken). Class definitions do not need to inherit from some special base class. This means that TCB objects can be written in normal Python and passed into CapPython safely; they are defended by default from CapPython code.

However, class objects are not encapsulated by default. A class object has at least two roles: it acts as a constructor function, and it can be used to derive new classes. The new classes can access their instance objects' private attributes (which are really "protected" attributes in Java terminology - one reason why PEP 0008 does not use the word "private"). So you might want to make a class "final", as in not inheritable. One way to do that is to wrap the class so that the constructor is available, but the class itself is not:

The function make_frozen_dict is what you would export to other modules, while FrozenDict would be closely-held.

Maybe this wrapping should be done by default so that the class is encapsulated by default, but it's not yet clear how best to do so, or how the default would be overridden.

I have started writing a static verifier for CapPython. The code is on Launchpad. It is not yet complete. It does not yet block access to Python's builtin functions such as open, and it does not yet deal with Python's module system.

Just out of curiosity: Why are you building this (that is meant in the least aggressive/disparaging way possible, I forget the emoticon ;)

Is it that you come from another language that has strictly enforced private variables, and wish for this capability in python too, or are you building something that absolutely needs them? Again, I'm not saying there's no use case for this, just wondering what yours is...

I've trained a number of people on python and those coming from languages like C++, Java, C# struggle with the idea of not having private, protected etc (almost as much as not having 'type safety). I can tell you though, in practice, working with hundreds of thousands of lines of production python code, multiple authors, over the past 8+ years... it's just not that big a deal. Use the naming conventions and move on.

Commenters that are commenting about protecting private variables are missing the point -- this isn't about executing trusted code and just assuming it doesn't do anything nasty. This is about executing untrusted code.

For example, a website could allow users to write custom Python code which runs on the server and customizes their pages on the site.

How does encapsulation buy you sandboxing? Am I missing something? I don't see the logical connection between sandboxed/jailed code (with disabled file/os/socket libraries) and having _private members...

Is it just a prerequisite for the downstream task of locking out the dangerous calls?

I had a similar project involving both the verification of pure functions and a bytecode transformer that enforces the behavior. I see similar goals and possible cross pollination. I've been looking to revive the project and release it.