Python Notes

Friday, November 26, 2004

A simple protocol for data synchronization between objects

What follows can't still be considered a definite version of this idea. It's still a prototype of an idea that I have been working on in a rather fuzzy way, to solve a practical problem that arised as I started to used templates and a declarative style approach to model business entities. Consider it as an attempt to write it down to clarify things for myself; if you find it useful, or at least amusing, please let me know.

As a business application grows, there's an increasing need to manage internal data communication between different objects. The temptation is to share live data, in such a way that all applicable objects are automatically notified of any change. However, this approach raises a number of issues with concurrency, threading, and security in general. It's a very complex and demanding problem.

A different approach is to share static copies of data using a simple synchronization protocol. Each object stores its own copy of the data. At certain points in the system, data is explicitly copied between the objects. This approach isn't certainly as charming as having live links; however, it leads to a much more predictable and simple design. It's one case where practicality beats purity.

While designing my new applications, I gave a lot of thought to simple templating objects. The current implementation allows to declare nested templates, and it's useful to model a variety of entities ranging from simple data records, data entry forms, reports, and validation procedures. These templates can be thought about as differents visions of the same data; each one has its own internal structure, appropriate to solve the problem at hand. For example, simple data records tend to be flat models. Other structures, such as forms or reports are usually mapped to nested templates. Nesting is used in this case to replicate the structure of the entities being modelled. A similar declarative mechanism is also by SQLObject to declare database entities.

At this point, the question is: how can I synchronize the state of data between objects which do share the same attributes, but have different structures? For example, let's assume that I have these three templates, one for a database entity (using SQLObject), another for a data entry form, and the last one for reports.

These templates are really simple, and provide a good showcase for the data synchronization protocol. Each one has a different structure that reflects the actual requirements of each object. However, all templates refer to the same basic data, that is declared in the DataTemplate object.

One solution to the problem is to provide the UserForm and the UserReport templates with hooks to receive data from the DataTemplate. However, in the spirit of dynamic languages such as Python, this is not really necessary, and in fact removes some flexibility out of the result. It's much easier to use a common protocol in such a way that all objects that use a similar template structure can talk to each other.

The protocol involves two types of objects:

The intermediate data record is a simple dict-like object that stores a flat representation of the data. Any object that exposes a simple mapping interface can be used.

The getdata and setdata methods allow templates to respectivelly retrieve and set internal data using an intermediate data record.

# creates a form initialized with the same data
form1 = UserForm(adminuser)

# the UserData itself also exposes the mapping
form2 = UserForm(user)

# the report receives an iterable that yields objects with a compatible mapping
# interface. dbUser is be a SQLObject instance with appropriately named fields
# (assuming that sqlobject is patched to implement the mapping interface)
report = UserReport(dbUser.select())
report.generateReport()

All communication is done internally using the getdata and the setdata methods. For nested templates (such as the UserForm and the UserReport), the representation is automatically flattened, allowing the data to be set directly to the inner members of the template.

The intermediate data record can optionally offer a much richer functionality. It can check types, or do a limited validation of the data (possibly limited to a sanity check). Dynamic link between templates can be supported by using the observer pattern in the data record implementation.

The limitations of this design are simple, and relatively obvious. First of all, all names must be unique. For some structures it may be a problem. One possible alternative is to allow to specify the ids to be used for complex structures. One such example are INI files, which are usually nested, but are the primary source of data. The flattened version should still have unique names for all its members.

Another interesting design issue is the one of nested data records. Data records are flattened to simplify the design. But in many cases there is a one-to-many relationship that needs to be mapped; for example, a form with subitems, or a report with nested subgroups. I think that it's possible to extend the design for this situation while still keeping it simple and clean.

Tuesday, November 23, 2004

References on workflow modelling

There are a number of resources about workflow on the Internet. Finding the good ones is not that easy, though. First of all, the search term is too generic (Google found nearly 8 million results for it alone). There are a number of commercial offerings with little or no info on their web pages, besides mentioning "workflow" as one of the features. The academic sources are, for the most part, references to signature only journals. There are also a number of organizations and coalitions that deal with workflow. The problem, in this case, is to get a good first grasp of the terminology being used, which allows to finally find good references applicable to an open-source project.

After some research, the first practical results pointed in the direction of the Petri Nets. Petri Nets are a mathemathical model, that is used to model processes. A good number of the current tools uses Petri Nets as the underlying model. There's a good article on a Petri Net implementation for PHP that in turn led me to search for more information on the subject. Another interesting project is Bossa, a low-level Java workflow library.

The best theorethical references that I could find are the ones from Wil van der Aalst, a researcher from the Netherlands. His website has a number of good links, including some excelent lectures on the subject. Although the lectures assume that the reader is following the book from the same author, they do provide an interesting approach to learn about Petri Nets and their applicability to workflow process modelling.

The study of Petri Nets raises several interesting issues, some of which I hoped I could leave for a future version of the workflow engine. The original draft of the system was based on my own experience with workflow systems, and largely on my own intuition on how such a system should be designed. Reading the material I could see that, while I got it right on some instances, there also some issues that I really wasn't aware of. The model is now being revised to take into account such issues. It's nice to be able to apply solid theorethical basis to practical work: it gives an assurance about being on the right track.

Wednesday, November 17, 2004

What is adaptation?

This document was originally posted as a reply to a c.l.py post asking for more information on adaptation vs type checking for Python. It was well received, and I thought it deserved a spot here. This is a revised edition, with more information and some clarifications.

Adaptation is the act of taking one object and making it conform to a given protocol (or interface). Adaptation is the key to make dynamic code that takes parameters from arbitrary types work in a safe, well behaved way.

The basic concept underlying adaptation is the protocol, also called interface in some implementations. For all purposes of this discussion, and for simplicity reasons, we can safely assume that protocols and interfaces are equivalent (more on this later).

A protocol defines how an object should behave in a given situation. It defines both a set of primitives that must be supported by the object, and its expected behavior -- how is it supposed to work, and how it should be used in a real case scenario. For example: the iterator protocol defines the following primitives: __iter__ and next() (see the typeiter.html documentation). The documentation also tells that what the primitives do; for example, next() returns the next element of the iterator, and raises an StopIterator exception when it finishes. Any object from any class that implement these methods with the expected behavior, regardless of anything else (other methods it supports, or its ancestors), is said to support the iterator protocol.

Any object that supports the iterator protocol can be used whenever an iterable is acceptable. This includes for loops and list comprehensions. The biggest advantage of adaptation comes when one realize how flexible this design is, specially when compared with old-style type checking. In a old-style strict type checking environment (such as C++), parameters to a given routine must conform to the declared type of the arguments. For iterators, it would mean that only objects descending from a standard base class (let's say, "Iterable") would be accepted. Complex objects have to support multiple protocols, though. Multiple inheritance can be used to the rescue, but the final design becomes complex and inflexible.

Now, back to Python world. To support a protocol, all you need to do is to implement it. Although one can still use multiple inheritance to declare new classes with multiple protocols, this is not needed. In most cases, the resulting object can be used directly whenever the support for the protocol is required, with no need for adaptation, and without concern about strict type checking.

But there are situations when the object itself can't be immediately used; it has to be adapted, or prepared, to support the protocol. The adapt() call implements all the necessary magic to check whether the object supports a protocol, and to make the necessary adaptations (if any), returning a conformant object. The adaptation will fail if the object does not support the protocol; this is an error, that can be catched by adapt() in a superficially similar but fundamentally different approach from type checking.

The adapt protocol (as presented on PEP246) defines a very flexible framework to adapt one object to a protocol. It tries a number of alternatives for adaptation; for example, the object may adapt itself to the protocol, or a registered adapter function may be used. The result of the adaptation (if possible at all) is an object that is guaranteed to support the protocol. So, using adapt(), we can write code like this:

def myfunc(obj):
for item in adapt(obj, Iterable):
...

Of course, this is a simple example, but it is useful to understand the basic mechanism. After PEP246 was published, other alternative implementations were published. The PyProtocols package somewhat extends the concept.

Finally, one may be wondering, is there any situation when an object needs to be adapted? Why don't just check for the availability of the interface? There are many reasons to use the adapt framework. The protocol checking is just one of the reasons -- it allows errors to be
catched much earlier, and at a better location. Another possible reason is that complex objects may support several protocols, and there may be name clashes between some of the methods. One such
situation is when an object support different *versions* of the same protocol. All versions have the same method names, but semantics may differ slightly. The adapt() call can build a new object with the correct method names and signatures, for each protocol or version supported by the object. Finally, the adaptation method can optionally build an opaque "proxy" object, that hides details of the original methods signature, and it's thus safer to pass around.

Adaptation shines when used with complex frameworks. Each framework define lots of protocols, and there are often discrepancies (or mismatches), and an adapter in-between is required. The adaptation system (as implemented by PyProtocols) supports a global register of adapter functions. Using adapt() at convenient locations, it's posible to mix and match objects provided by different frameworks, with no need to worry about compatibility issues. The work may be done just once, on the adapter; once registered, the adapt() calls will take care of all necessary work.

Using adaptation effectively requires discipline. It's too easy to get lazy and forget to include adapt() calls at the required locations. But the advantages are immense, for the adaptation system preserves Python dynamic aspects while adding still more flexibility to the package. It's a great addition to an already great language.

Closing remarks

Protocols and interfaces are similar concepts, but not equivalent. Interfaces are just the set of methods and their individual semantics; it does not define the "how to use" part as a protocol does. However, for all practical purposes, the concepts converge, because it does not make much sense to keep with the strict static protocol definition for long.

This document was started as my attempt to contribute back to the Python community something which I have learned while reading c.l.py and working with Python. I also have to thank Alex Martelli for his comments and clarifications on my original post, as all the others who have helped me (knowingly or not) over the past few months.

Tuesday, November 16, 2004

A dynamic class repository

As part of my workflow project, I came across an interesting problem. Transition entities are modelled as classes derived from a base ActionDef class, because each action in the system needs to have its own custom code. As the system grows, I needed to design a repository for such classes.

The easy way to implement the repository is to put everything inside a package. It's reasonably scalable; a package can easily hold a lot of classes. But at some point, the package will become harder to maintain. There are other problems also. The application is designed to work as a long-running server. Shutting down the server to add new classes should not be required; instead, new classes should be added or redefined dynamically.

At this point, I started to contemplate how to design a dynamic Python class repository, in a safe and clean way. One simple way is to use a database to store the code for the class. At first, I was concerned that this would be a huge security problem; but then I realized that no matter what I did, code would have to be retrieved from the repository anyway, and a database would be no worse by itself than any other alternative. But there are some real issues with databases; the biggest problem is that it's not convenient to maintain, and also, not compatible with some standard code development practices, such as versioning repositories (I'm using Subversion).

After giving up on the database idea, I decided to keep it really simple, and use a standard file-based class repository. A class factory function gets the name of the desired class and looks after it in the repository. It loads the source file that has the same name of the class, and returns an instance of the class for the application to work. Relevant code goes like this:

def makeObjFromName(self, name, expectedClass):
'''
Creates a new instance of class using the process library

Searchs the pathlist for a file. If the file exists, executes
its content. It is assumed that the source file contains a
class with the same name as of the source file (without the .py
extension, of course). For example:

file name = pdCreateNewUser.py -> class name = pdCreateNewuser

If the file does not exist, or if there is no symbol with the
correct name inside the file, it will raise a NameError.

If the file exists, and the symbol is define, it checks if the
symbol is bound to a subclass of the expectedClass. If not,
it raises a TypeError. If the class is correct, the object is
automatically instantiated and returned.
'''
for path in self.pathlist:
filename = os.path.join(path, name+'.py')
print "** trying " + filename
if os.path.isfile(filename):
execfile(filename)
try:
print "** executed!"
obj = locals().get(name)
if isclass(obj) and issubclass(obj, expectedClass):
print "** found!"
return obj()
else:
raise TypeError, 'Object %s is not from the ' 'expected class %s' % (name, expecteClass)
except:
break
raise NameError, 'Could not find class %s in the library' % name

The file based repository is simple, and also more convenient to work with than the database. But it still far from a ideal solution, and I'm still looking for improvements. One of the ideas is to include native support for Subversion in the repository code itself, using the Python Subversion bindings. Instead of a file read, I could checkout the latest version from the repository whenever required; it may be slow, though, so it's something to be carefully tested.

Another issue is that class references are passed as strings, instead of pure Python symbols. This is needed because all references have to be resolved at runtime by reading the source code on the fly. Also, as each class is defined into its own source file, there is no way for one class to refer to any other. The resulting code does read as nicely as it could. For example, the following code snippet shows a task definition class that holds a reference to a process definition:

In the first case, the idea is to use the __getattribute__ magic function to capture attribute access to the ProcessLibrary object. It's tricky, and perhaps a little bit dangerous, because __getattribute__ imposes some limitations on what can be done. It allows the code to look as a pure Python attribute access, which improves readability; but on the other hand, it hides too much magic from the users, which may not be the best thing to do.

In both cases, to avoid problems with circular references, it may be necessary to move the code from the class body to the __init__ code. There are also other projects that suffer from the same problem (referencing class names with strings); SQLObject and ctypes comes to mind. Finding a good and generic solution for the workflow library may also be helpful for these projects.

Thursday, November 11, 2004

News on my Python development environment setup

It's been about two months since I started rebuilding my development environment using only free tools. For a long time, I was stuck in the search for a good IDE. It took me a while to figure out that I was not missing the IDE, but a set of good project management tools. Once I realized it, I could find several good code editors that supported my basic requirements: the ability to create projects, to group related files; the ability to store session information, so I can resume working where I left; and the ability to quickly navigate all the files in the project.

Now, I'm going a step forward. I'm moving all my development files into Subversion, an open-source source code control system. It's not a new concept to me -- I have used CVS and commercial tools before -- but for several reasons, I wasn't willing to do it before, basically because I hadn't a stable enviroment. My home machine is still a low-performance Win98 box, and my dev machine was a temporary one. Now that my development machine is (hopefully) a permanent one, I have installed Linux, and now I feel better about setting up a true development environment.

Up to this point, the installation is being surprisingly easy. The thoughest part was to define the repository structure: the question is, at which level do I want to control things? The easiest way is to put everything in the same repository, but this may lead to problems in the future. Anyway, it's not something that I can't change later. I'm going with the single repository now, and each project is a directory inside it. If some project grows enough to deserve its own repository, so be it.

Another related thing is the use of a test-driven-methodology, which I'm already commited to start using from now on. I think that the source code control system will make it easier to implement it. Also, the fact that I had to make some decisions on repository structure means that I could reestructure the way I store the source code. I have splitted some projects, and created the tests directories. Let us see where does this experience leads me...

Alternative database systems

There was a time when a database meant a flat file, fixed record repository. Indexes were added later, bringing better performance for several tasks. During the sixties, hierarchical database systems were developed, allowing to model complex real-life structures better. Even today, old-style mainframe systems (such as IBM's IMS) are still in production, managing huge databases. SQL was only invented in the seventies, based on a mathematical formalization of high-level data manipulation algorithms. Batch processing systems read and process data in a sequential fashion, and normally do not need such abstractions. But the new generation interactive systems really needed them. And when PC-based client-server computing exploded in the 90's, SQL kingdom was started.

For those who develop conventional business applications, it currently seems like SQL is the definitive database system. Although SQL has several strong points, its current near monopoly can probably be explained by academia indoctrination: almost every CS graduate in the past fifteen years was told that SQL is good, and that the rest is bad. Part of it may be because there was nothing better at the time; also, and specially compared to its predecessors, SQL mathematical foundations gave it a kind of scientific validity that is loved by academia. But even SQL pioneers agree that there are problems. Of course, the diagnostics vary a lot. Some people (such as renowed C.J.Date) think that basic mathematical model itself is correct, and that the current implementations are flawed. There are some who believe that SQL itself is dead, and advocate instead a XML based model. One line of research that still failed to reach widespread adoption is the object-oriented database concept. There are several implementations, but somehow they fail to have the same level of public awareness that RDBMS or XML-based storage has today.

It seems a good situation to apply the age-old ditto: when all that you have is a hammer, everything looks like a nail. Relational databases are great, but can't solve all problems. XML is also interesting, but is bloated and confusing -- kind of an overabused tool to do too many things beyond its original roots. Object databases are interesting, but normally suffer from being too tied to a single environment.

In the middle of this, there is a unforeseen trend in the use of the file system as a storage medium. Yes, the file system. Guess what? Forget the FAT, please. Current file systems are much more stable and efficient than older ones. Modern filesystems are hierarchical, and can store arbitrary objects. Support for journaling, and better metadata management means that the filesystem is now a better choice for many situations. Several web publishing engines (blogs, wikis, and even full-fledged content management systems) support filesystem-based storage for text notes and documents, which were previously stored (in a hackish and haphazardous way) into DB blobs. The full filename is now a primary key, and flexible relationships between entities can be expressed as hiperlinks.

Of course, the same age-old advice applies for filebased storage: it's not suited for every task. But it's better than many academics would like to admit, and much safer than the old "I've seen a file system corrupted" guys would fear. For many things, it's already the best bet. The best part? It's free, and comes installed in the OS.