Fri, 22 May 2009

Yardbird

Ladies and Gentlemen a One Mister Charles Parker, Jr.

An IRC channel I consider my "home" channel is coming up on its tenth
anniversary in a few months, and its founders have begun to reflect on
how far we've really come in a decade. By far the biggest
disappointment is that our beloved and snarky bot still runs a rickety
hacked-up version of Kevin Lenzo's 1990s classic, Infobot.

We've looked into replacements in the past, but they always seemed like
mere incremental improvements. They'd provide some degree of
reliability and a slightly saner codebase, but they generally add little
that appeals to us. Often they're written in Perl, like Infobot, which
is something we're trying to move away from (for no reason beyond the
fact that nobody in the channel is comfortable maintaining Perl code any
more).

So when I finally found a bot that managed to have reasonable gains in
terms of fun over Infobot, I leapt at the chance to put it through its
paces despite being written in Perl. For various other reasons, it was
not fit for purpose, although looking at the code led me to an
inspiration.

How Not To Write A Bot

The bot in question was written with POE, which appears to be Perl's
answer to Twisted Python. POE gives you lots of library functions and
objects that allow you to do things asynchronously by firing off events
and registering callbacks for when something finishes. This is somewhat
important, so that you don't say "Bot, do this five-minute thing" and
have it fall off the network because it hadn't returned to the protocol
code for five minutes.

But the way this bot was written was done in a very hasty JFDI sort of
scripting style. I'm told that the real marvel was how quickly it was
brought up and running, and I don't think it's fair to judge the authors
based solely on this work. However, there was something of a common
antipattern throughout the command recognition code:

Dispatchers

While tracing through a printout of this thing in an attempt to even
figure out what its features were, I thought "Gosh, wouldn't it be
great if this had a dispatch mechanism where you could associate regexes
with functions in some kind of data structure, along with some kind of
application data for context?" And of course that immediately reminded
me of...

I filed this little bit of inspiration away, and started worrying about
how I'd go about cloning this system. I'd use Twisted Python, of
course, but the asynchronous database library is on the same level as
POE's, and I prefer a good ORM. Man, wouldn't it be great if my bot
could use the Django ORM?

Asynchronicity

Of course, the Django ORM isn't built with a callback-based API, so
while your code does a query that's all your program can do. This
prompted me to wonder why that doesn't become a problem for Django Web
apps. Surely they receive hundreds or thousands of requests per
second, but concurrency never becomes a problem that exposes itself to
the app programmer.

The answer is that while Django does not have any support for
asynchronous programming, the very model in which it operates assumes
that it's being called from a Web server such as apache. Apache has its
own forking or threading model that it uses to handle lots of requests
simultaneously, and the CGI or WSGI interfaces use a well-defined
interface for passing connection information into a program and getting
a response back out.

An IRC Apache

One morning on the Underground, I began to reason that what I needed was
a sort of "IRC Bot Apache" that would handle incoming IRC events of
various sorts (PRIVMSG, ACTION, etc.), and then pass them along
with some connection information to my django code. I'd dispatch these
messages through the regex-based patterns() system, then call view code
that uses ORM objects to perform queries.

The obvious choice for implementing something like this is Twisted
Python, which shows its age but remains the de facto Python library for
coding state machines. I'm only occasionally familiar with the system
(and the documentation is filled with distracting Java-esque Software
Engineering babble for some bizarre reason), but I was able to localize
the actual Twisted-using code to one function, which at the very least
makes it simple to hand off to experts to tell me if I'm doing anything
stupid.

Yardbird

Digging through twisted documentation I found their example LogBot
and based it loosely on that pattern, subclassing irc.IRCClient and
replacing the privmsg method along the following lines:

The inlineCallbacks decorator essentially catches any yield of a
Twisted deferred object and schedules the next delve into the generator
using standard Twisted deferred-execution mechanisms. So now yield
really behaves like a scheduler yield, and you can let some more
critical IRC-parsing code run between your own calls.

Next we build a Django urlresolver object so we can dispatch regexes
to handler functions, using some Django settings info to determine
path info:

Then we build a request dictionary. In normal Django this would be
an HttpRequest object, containing all sorts of information about the web
server and the remote client and the HTTP request itself. Since this is
a quick-and-dirty example, I've reduced this to a dict for simplicity.
I also passed in the settings namespace just to be lazy (so I can keep
things like nickname in there):

Now we actually use our resolver to test the incoming message against
all our patterns in privmsg.py and return to us the appropriate
function, along with all of the anonymous and named matches that were
generated by the winning regular expression. Note that we have to
prepend a / to our message to appease the URL-centric resolver:

callback, args, kwargs = resolver.resolve('/' + request['msg'])

Finally we get to the deferred execution magic! We have a function, a
request object, and some arguments made from textual analysis of the
message. We use the threads.deferToThread method to generate a
deferred object that runs in a completely separate thread, and yield it
up to our inlineCallbacks decorator to be scheduled:

Our view function then runs in the background, taking as long as it
likes while our bot concerns itself with answering PING replies and
dispatching further events to the resolver.

We're confident that the Django code is reasonably thread-safe, as it
has to handle concurrency under a variety of Web server models (such as
apache's Worker MPM or a traditional Prefork model). Once the function
returns a value, the thread closes and execution comes back to this
method again, chucking the returned value into our response object.

We're almost done, but we still need to actually do something with this
information! In ordinary HTTP Django this would be an HttpResponse
object, containing all sorts of information on what template to render
and what dictionary to pass in as an extra context namespace. This is a
bit overkill for this example, so I've simplified it to another dict:

The various RFCs for IRC all state rather loudly that automated bots
are meant to speak using NOTICE but always ignore NOTICEs from
other sources. This is meant to prevent feedback loops flooding a
channel. Also note that since this was the final statement of my
inlineCallbacks function, I called the defer.returnValue to spit back
the result of the notice call. I'm not convinced that it was at all
necessary, but I believe it's harmless boilerplate in the worst case.

Whoa is that all?

My current version is obviously not exactly like this. I've been doing
some rather wild and thrashing testing and debugging, and there's some
mess and refactorings. The above is intended as a demonstration of the
technology only while I do my explorations in the yardbird bazaar tree.

For a start, my current version applies an errback function to log
exceptions in the view function thread, and I've refactored the above
code into a separate function that both the ACTION and PRIVMSG
functions can use. The principle is still the same, though.

What's With The Name?

Because Django is named after Django Reinhardt, and was split from a CMS
project named Ellington (as in, Duke), it's become traditional to name
Django projects after Jazz greats. For example, there's a popular
Django e-commerce system named Satchmo. I dug around and couldn't find
any named after Charlie Parker, or his nickname 'Bird'. I just decided
to play it safe and use the rarer long form "Yardbird".

Where do I get Your Version?

My spazzy tree, complete with README.txt files for the apache indexing
and a hackish approach at implementing some of the Infobot
functionality is up at http://zork.net/~nick/yardbird/. It's also a
bazaar tree, so you can just:

bzr branch http://zork.net/~nick/yardbird/

Right now I'm trying to figure out what I should really do for the
IrcRequest and IrcResponse objects, and how to properly package all this
up like a proper professional project.

Your code sucks!

Soz.

Sat, 14 Mar 2009

Your favorite ORM sucks

Over the past year or so I have been learning and enjoying Django, which is a large set of python modules that are impressively nice for creating dynamic Web sites. It's the latest generation in a set of tools that have been written to solve this very problem, going back all the way to the original CGI libraries of the mid-1990s.

You can get more details on the Django Web site, but the basic pieces you interact with are these:

Object-Relational Mapper:

This saves you from having to type in COBOL-esque SQL queries and manually connect the results to the data structures in your code.

Template Language:

This saves you from having to type in the punctuation-heavy HTML or XHTML to render your pages, and lets you focus more on the content.

URL Dispatcher:

This saves you from having to build trees of scripts or put argument-based flow control spaghetti in your code, and lets you restructure the layout of your URLs in a nice clean manner.

Form Handler:

This ties the previous three together to safely simplify one of the more dangerous aspects of programming: input validation.

There are more subsystems (such as the libraries of HTTP handling functions available to you in your “view” code), but those four seem like the legs of the table to me.

Feelings of Inadequacy

As I've been working with this, I've run into a number of conversations on IRC that seem to go something like this:

Really? I find it wonderful to have all that coordination among the pieces.

nerd:

Django's ORM sucks and is too Django-specific!SQLAlchemyforever! Behold The Power!

Okay, so what can you do? After enough of these conversations, I had this sense that I was kind of using the My First Sony of Web development systems. And the Loosely Coupled, Tightly Integrated motto has been pointed out to have the effect that while you can swap a new ORM into your Django app, few people ever use the Django ORM in anything else.

Eventually it came time to do some work on a non-Web application and I needed a database. With all these conversations with Riot Nrrds like the above, I figured I'd save myself a lot of trouble by going straight to SQLAlchemy.

The Reason We Have ORMs

There's a reason we have these “Object-Relational Mapper” things, and it's because of a problem known as the Object-Relational Impedance Mismatch. Basically, the formal mathematical model for databases used to ensure that they stay intact follows a system of tables with rows and columns and references to other tables, while data structures in most programming languages we use today manipulate data in nested tree-like structures. It's rather like the difference between a spreadsheet and an XML document, or between a ledger and a family tree diagram.

For a long time people just suffered through it: building SQL queries in strings and shoving them over the wire to databases, then manually parsing the results and populating the local data structures, all the while hoping nothing got mistranslated in the process. But in the argument over why we keep relational databases around in this object-oriented day and age anyway, people noticed that there was a formal and automatic mapping you could perform between the two.

So fundamentally the ORM is there to allow you to manipulate all your data in one single format, save you the trouble of re-inventing the necessary mapping techniques between the two, and cut down the complexity of your code thereby reducing your exposure to bugs. It's a fantastic thing, since relational databases are still important for their speed and reliability, and every reasonably-expressive programming language since LISP (which is from the 1950s, after all!) has had an implicit bias toward hierarchical representation of data.

Behold The Power!

Having fiddled with Django's ORM, and looked into Ruby's ActiveRecord system and read a bit about the history of Java's Hibernate ORM, I decided to see what was so hot about all the other Python ORMs out there.

I quickly learned that my sample of three had been somewhat biased toward a particular state of mind: namely those who want the impedance mismatch to actually be solved for them. The majority of ORMs used by the Riot Nrrd set instead seem to be written by people who enjoy the power of SQL and demand lots of advanced SQL features. Nothing to actually make your job as a programmer easier ranks anywhere on their priority list. One of the high-end python ORMS even brags in its Features list about how it can't do schema generation!

The result of all this talking out of school was that I had a project using SQLAlchemy and it was a horrible bureaucratic mess. I had session object setup and tear-down all over, manual coupling between table objects and actual useful objects for some reason, and it barely worked at all. I was forking processes off that needed to do a bit of database work, and my transactions and sessions kept stepping on each other. Hell, even the simple single-process jobs were hard to get right!

I'm sure this is where the Riot Nrrd contingent would step in and scream about how incompetent I must be. Why, a truly skilled SQL craftsman would be able to foresee all the intricate concurrency and relational integrity performance-critical session collision transaction issues that would arise, and work around them!

Back to the Tinker-Toys

Frustrated and eyeing a looming deadline warily, I decided to see if maybe I could just stick to what I knew and come up with a better solution. Hey, maybe I'd bite the bullet and set up the whole thing as a Web app inside the much-derided all-singing all-dancing framework (a term that to me typically refers to an unfinished project, rather than a useful set of libraries).

James Bennett is one of those hackers you wish there were more of on the Internet. He's knowledgeable and proficient while remaining reasonable and courteous. He reminds me of an old Jesuit brother I once knew who never seemed to open his mouth unless he was saying something that would help someone out.

Anyway, Bennett wrote an entry of his own on how to do standalone django scripts, and that was in 2007! In 2007, Django was still in version 0.96, and there was a lot of isolating of components to go before the 1.0 release. And damn if the settings.configure() trick isn't simple and straightforward!

…and then my database code lives in myapp/models.py and is something like ⅓ the length of the corresponding SQLAlchemy horror. It also gains a little input validation magic from some application-level data types (such as “IP Address”) that assert constraints not present in the database itself.

The Return of Session Management

Ah, but there comes a wrinkle! You just knew there'd be a wrinkle, didn't you? Remember how I was forking off asynchronous handler processes? So here's where that sinking feeling of inadequacy returned. I mean, Django's implicit session handling was causing me a headache, which sounds like precisely the sort of thing the Riot Nrrds were kvetching about!

Django's DB access model is very conservative: it won't chat with the database basically until the query you're building is at a stage where it needs information from the DB itself to proceed any further. It's a very lazy approach and I love it to bits. The session handling is all done behind the scenes, and any DB-bound “model” object will try to use a pre-existing connection if possible, but will open a new session if needed.

So I rolled up my sleeves and prepared to throw manual and bureaucratic session-management code all over my app, and ended up with the following:

That's it! Two lines to just drop the current database connection before forking, and one of them is an import! Then the parent and the child each get a new session the next time they perform a query or update operation on any Django model object. I'm still picking my jaw up off the floor over how simple this was.

Django For The Win

So I got to do a massive bit of coding with the delete key, and I can now be confident about the correctness of all my database code. Furthermore, I did it without dragging in any baggage from the rest of the Django set (unless you count the brief use of the django.conf.settings module, which is a fair point).

So where does all this hatred for Django come from in the Riot Nrrd set? I suspect it is partly a quick look at early (pre-0.96) versions that were still in a “we're working on separating our framework from the Ellington CMS” state, and largely a reaction against the horrible “type a bit of python code into this browser input box” model of early Zope releases. I think that Zope in the late 90s left a lot of programmers with a bad taste in their mouths.

But in the end none of the complaints about Django held up, and SQLAlchemy simply wasn't fit for purpose. I ended up feeling kind of cheated, like I'd lost a week because I listened to some uninformed Real Programmer posturing that did nothing but make trouble. It could have been that I just hit a bunch of DBAs who use Python occasionally instead of people like me who are Python programmers who occasionally need a database, but I know that's not really the case.

Okay Maybe Only 99.9%

Right now my only complaint about Django is that the #django channel on Freenode contains an op by the nickname of Magus who is the absolute opposite of James Bennett. Every time Magus answers someone's question, he does so in the most condescending way possible. Someone noticed this and set up a twitter feed to log each time he uses “obvious” or “of course”, but in February he changed his nick to avoid it. I'm sure this guy is a useful contributor to Django, but answering even naïve questions like that is simply not helpful.