This is the last in a series of three posts (1, 2), discussing issues with pythons urlparse module. Here, I intend to provide a solution.

In the last post, I was talking about parser combinators and parsec in particular, mentioning pyparsing towards the end. The angel-app being a python application, parsec, while cool, is of no immediate use. pyparsing on the other hand provides parsec-like functionality for python. Consider this excerpt from the RFC 3986-compliant URI parser that I'm about to present in this post (please ignore as usual the blog's spurious formatting):

Anyhow, what I mean to say is this: We have a validating URI parser now. Apart from the bugs that are still to be expected for a piece of code at this early stage, it should be RFC 3986 compliant. You can get either the python package, or a tarball of the darcs repository (unfortunately my zope account chockes on the "_darcs" directory filename, so I'm still looking for a good way to host the darcs).

In a previous post, I described issues with parsing and validating URL's with the functionality provided by Python's stdlib. I will just restate that clearly, all messages exchanged by angel-app nodes must be validated in order for it to work properly. What to do? First of all, I was of course not the first person to notice the module's shortcomings. However, I was surprised at the answers that popped up: It seems like no one was interested in actually coming up with a validating parser (perhaps even just for a subset of the complete URI syntax), but instead people focussed on fixing specific cases where the parser would fail -- in essence adding new features, rather than putting the whole system on a solid basis. Suggestions go so far as to propose a new URI parsing module. However, the proposed new module is again based on the premise that the input represents a valid URI, the behavior in the case of an invalid input is again left undefined. WTF? Have these people never looked beyond string.split() and regexes?

Dudes, writing a VALIDATING PARSER is NOT THAT HARD, if you have a reasonable grammar and good libs. Why do people keep pretending that it is? Sure, you might be afraid of having to fire up lex, yacc and antlr, and for good reason. But with sufficiently dynamic languages, that's usually unnecessary, if you have a parser combinator library handy.

The key idea behind parser combinators is that you write your parser in a bottom up fashion, in just the same way that you would define your grammar. You write a parser for a small part of the grammar, then combine these partial parsers to form a complex whole. The canonical example in this context is Haskell's parsec library. Let's start out with a simple restricted URI grammar:

Note how the complete URI grammar specification in the RFC is barely a page long. So yeah, implementing this grammar is a significant amount of work (of course you could always choose to support just a well-defined subset), but if you have a good parser combinator library, it's just a few hours of mechanically transforming the ABNF into your parser grammar. You can even watch the Simpsons while doing it (I did). In the case of Network.URI, this boils down a line count of 1278, with about half of the lines being comments or empty lines. Not only that, but given the complete grammar specification, it's super easy to formulate a modified grammar.

As it turns out, Python has a library quite like parsec, it's called pyparsing and I'll bore you with it in my next (and last) post on this topic.

The moral is obvious. You can't trust code that you did not totally create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from using untrusted code. In demonstrating the possibility of this kind of attack, I picked on the C compiler. I could have picked on any program-handling program such as an assembler, a loader, or even hardware microcode. As the level of program gets lower, these bugs will be harder and harder to detect. A well installed microcode bug will be almost impossible to detect.

The task seemed simple enough. We had been passing around links between clones in a URL-like format of the type ${host}:${port}/${path}, with a small custom parser (an ugly hack) for parsing and unparsing these things. As we adapted the code to support IPv6 it turned out that in many cases (i.e. unless the nodename field was configured), raw IPv6 addresses would be passed around, and the parser would of course choke on that. Fair enough, I thought, time to use the established standards and

import urlparse

Now this is supposed to split the URI into parts corresponding to scheme, host, path etc. like so

(Why do blogs always _INSIST_ on fucking up source code? But we're kind of on topic, so maybe this fits). Anyhow, we have a fancy caching scheme, but the parser itself consists of a bunch of if and uri.split() statements. Talk about premature optimization. More than that, one should think that language implementors know a thing or two about parsers...

Consider: the parser is written in such a way that the result is predictable if and only if the input string represents a valid URL. But how do you find out if a string is indeed a URL? The answer is easy: you use a parser. In other words, the urlparse module is in most cases useless, because unless have sufficient control over the input (unlikely for networking apps) the parse result is essentially undefined.

However the urlparse module is not only "useless", it is in fact dangerous, since by using it for untrusted input, the behaviour of your app is by implication also essentially undefined (how do you handle an undefined result?). Now consider the following quick google code search. I don't suppose that any of the following names rings a bell with you: Zope, Plone, twisted, Turbogears, mailman, django, chandler, bittorrent. Surely all of these software packages have carefully reviewed all of their uses of urlparse, and properly identify and handle all cases where an arbitrary result may be returned... Script kiddies, REJOICE!

We're highly pleased with the progress we have been making lately: The next release of the ANGEL APPLICATION is to be expected for one of the coming weekends (obviously, it's ready when it's ready, we're largely debian nerds after all). The obligatory screenie (looks haven't changed much, tho'):

Major changes include:

a completely revamped security model: we have abandoned our previously mixed pull/push model in favor of a purely pull model. This greatly simplifies the code, and increases security by disallowing any (with one tiny, optional, exception) modification of data on the clients by remote agents. However, this required

NAT traversal support. This we implemented by adding optional support for NAT traversal via teredo/miredo. This in turn required

(optional) support for IPv6 in the twisted matrix library, our primary infrastructure library. The extension is available as a (limited, but self-contained) add-on module from our subversion repository.

To support transparent addressing in the face of a schizophrenic internet infrastructure, agent.POL has implemented a dynamic DNS service that supports IPv6 (note e.g. the clone located at vincent.dyn.kraeutler.net, IPv6 required). He's currently offering that as a free service on majimoto.net. We plan to integrate it more tightly into the angel-app as time and resources permit.

A revamped configuration subsystem.

Improved GUI support.

An extensive code cleanup, resulting in a reasonably clean object model and a rather thorough unit test harness, while actually reducing the size of the code base.

It turns out that adding IPv6 support to the twisted library is rather straightforward. Originally, I just hacked up a few changes to get my prototype running with teredo, resulting in a few lines of changes to the twisted networking code (patch available). It turns out that the resulting code seems fully backwards-compatible with IPv4. Unintended, but highly welcome ;-) YAY (IPv6 required)!

Anyhow -- if you're a hacker, give teredo a try. Turn your laptop into a server in a few minutes. It's a sexy piece of technology which I think will greatly change the way we think about and work with the internet.

One thing to keep in mind are security issues: teredo provides you with a globally visible IP address, meaning you're directly addressable worldwide. NATs and many in-between firewalls are tunnelled through. If you're using a mac, add something like the following ipfw firewall ruleset (thanks POL) to your miredo startup command (see /etc/miredo.conf) to protect you from unsolicited and possibly dangerous traffic:

The Stanford Humanities Lab and affiliates created this Machinima clip about their (at least) double-layered project called "The Dante Hotel" that conserved and experimented with a real-life hotel room appropriated in 1972, originally by Lynn Hershman Leeson and Elenor Coppola. More here.