We are hiring

Non-blocking parsing

Last month we saw one way how to produce decent error messages while parsing. I’ve also mentioned that parsing with derivatives is a non-blocking parsing technique. What’s that actually mean?

Let’s look at a counterexample first. PetitParser assumes that it will parse a complete whatever-it-is-you’re-parsing: a complete Smalltalk method, a complete SIP packet, a complete HTML document. If you attach a PetitParser to a socket, and you give it half a SIP packet, it will wait for more input, blocking the execution of that thread. (Really, in most Smalltalks that means blocking a Process, which represents a green (preemptively multitasked) thread.

If you’re writing a network server and you use a blocking parser, that means you’ve just added a denial of service vector: an attacker can open connections which will use up more and more threads, and give you a headache.

In contrast, a non-blocking parser will read what it can from its input, and return. In the case of parsing with derivatives, it will return a parser that contains parses for the input received so far. So let’s try it out!

Note that each get pulls precisely one character off the underlying stream. In the context of a network server, we can simply read whatever’s in our socket buffer, and the parser will process what it can and return. No denial of service, and it means you can employ standard asynchronous IO in a thread that services many sockets as and when they receive data.

Under the hood, the implementation’s quite simple: #parsing: returns an XTTransformReadStream. This stream lets us read an arbitrary number of elements from a source stream, process them in any fashion we like, and then write an arbitrary number of elements to a destination stream (from which downstream streams will consume). Here, we read one element from input, calculate the derivative of the current parser with that object, record that parser, and push it to output. We do have to fiddle a bit with the output – contentsSpecies: Array so that XTreams knows how to store the interim parsers.

One last note for those not familiar with Smalltalk syntax, the bit with the semicolon is a cascade. Messages separated by semicolons indicate a sequence of messages sent to the receiver of the first message. So