Friday, January 14, 2005

Do you ever have an idea, then somebody else has the same idea, but they screw it up? Well, I just had that happen to me. I stumbled on this article on Fast Infoset today. Fast Infoset is a specification for a binary version of XML. I had this same idea a couple of years ago, and the execution is amazingly similar, but, of course, they screwed it up.

Essentially, everybody is finally realizing that while XML is the first widely accepted data markup format, it's a pig. It's verbose and redundant, which makes it store poorly, transmit slowly, and parse, well, like a pig. Don't get me wrong, the original idea for XML was actually fine, but everybody has taken it way over the top and things like SOAP are just an abomination.

Well, one way to help XML while still retaining XML semantics is to make a binary version of it. While a compression program like gzip can reduce the overall transfer and storage size of an XML document, an XML parser still has to deal with the XML textual format on either end of a transfer. The textual format forces the decoder to examine each and every byte to determine its significance in the document, even when a decoder doesn't understand vast expanses of the XML schema that is being parsed. And that textual XML format actually expands binary data carried in an XML document by forcing it into BASE64 format which does a 3-goes-to-4 encoding.

So, my idea, and that of Fast Infoset, is to notice that most XML documents are extremely redundant with tag and attribute information. First, you have all those < and > symbols surrounding every tag. That's four bytes of redundant info for every opening and closing tag pair. Then you have the tag name itself, which gets repeated at least twice, once in the opening tag and once in the closing tag. Then, you have the fact that many XML documents are recursive tree structures where nodes at any given level of the hierarchy share a lot of the same type of info. For instance, an XML document storing a list of books would use TITLE, AUTHOR, and ISBN tags for each book. Add in namespace prefixes and attributes and you have a lot of redundancy, even before you start talking about SOAP. In short, the information density of your average XML document has Claude Shannon spinning in his grave.

Luckily, it's pretty easy to compress this information right out of an XML document. You simply create a tag/attribute hash table and assign each unique tag or attribute in the document a unique number. Rather than writing TITLE everywhere, you would instead use the number 0; 1 for AUTHOR; 2 for ISBN; and so on. So, rather than "<AUTHOR>" (8 bytes), we would simply have 0x0001 or something (2 - 4 bytes). By eliminating the end tag and encoding the content inside the AUTHOR tag with a 4-byte length, we have a net savings of 8 + 9 - 2 - 4 = 11 bytes every time we would otherwise use an AUTHOR tag. Do the same thing for all the other tags in your document and it adds up pretty quickly. Finally, by encoding binary data as binary octets, rather than BASE64, we eliminate the "ASCII tax" imposed by a textual format on binary data.

The interesting thing is that we still fundamentally have an XML document. This means that you can run this new format through a SAX-like or DOM-like decoder and produce exactly the same data that an XML-based application expects, and the application is none-the-wiser. Everything is just smaller and faster. Because the binary format parser doesn't have to actually look at the characters that make up all these tags and try to match them with other strings, it can whip through a document at light speed.

So, that was my theory. I even had a catchy name for it, BCX: Binary Coded XML.

Along comes Fast Infoset. Great minds think alike. Obviously, the need is there. Only they screwed it up. Rather than just keeping things simple, they decided to encode the whole thing in ASN.1 with packed encoding rules (ASN.1/PER). Now, for those of you who don't know, ASN.1 is about the most complex, worst, ugly data format ever designed. And it's no wonder, it was created by an international committee (ISO) as part of the Open Systems Interconnect protocols (OSI; anybody remember FTAM?). The various encoding rules used to actually serialize ASN.1 (and there are several, which is part of the problem) typically do a lot of bit-twiddling, which slows down encoder/decoders. It also makes them buggy. Anybody remember some of the recent buffer overruns attributed to ASN.1 endecs?

Well, it looks like Fast Infoset is being standardized in ISO (in fact, jointly with ISO/IEC JTC 1 and ITU-T, which is about as ugly as it gets in the international standards committee world), so they probably had to use ASN.1 or people would wonder why they weren't supporting their own standards. ASN.1 is used by a couple of protocols in common use today, including SNMP, X.509, and LDAP. That said, most IETF-originated protocols (you know, the ones that move all your web, ftp, email, etc., traffic around) use either straightforward text encodings, or far more simple binary encodings.

Fast Infoset does manage to get pretty good compression (20% - 80% reduction, depending on document size and content). Throughput for some documents is 2 - 3 times what standard XML delivers. Overall, these are good numbers.

But, it could have been even faster. Frankly, I'm partial to Sun's old XDR format, used by ONC-RPC and NFS, which was very fast, if not quite as tightly packed as some other formats. I was also recently reading about Joe Armstrong's Universal Binary Format (UBF), created to give Erlang a streamlined wire protocol.

In short, marrying XML and ASN.1/PER is like tying two stones together and hoping they will float.

How does WBXML fit into this picture? I don't know details but I know it's used for example by some cell phones to transfer SyncML data and there are libraries which can read and write WBXML directly without converting it to XML.

You'll be amused to know that ASN.1 has another set of encoding rules, called XER.

That's right -- ASN.1 encoded in XML.

Anyway, ASN.1/PER isn't that bad, actually. What sucks about it is the terrible syntax of the format specifications themselves. They would have been easier to handle if written as S expressions, or even XML.

I'd like a lisp parser generator that can handle the ASN.1 grammar. Zebu doesn't cut it. An ANTLR clone, maybe?

The thing to note is that XML was successful because it is similar to HTML. HTML was successful because it was simple and human-readable. (These are simplifications, but I think both were pre-requisite for the success of HTML and XML).

A binary format is not simple and is not human-readable. I think that XML often gets used for the wrong applications. Just because you can pass messages between threads using a simple, human-readable format, doesn't mean that you should.

I think there are two solutions to the problem of XML's verbosity and redundancy:

1) Standard, tuned, application-specific protocols for tasks that XML is misapplied to. For example, I think we need a simple, lightweight, efficient inter-process communication protocol that is standardised, widely supported and possibly integrated into new languages and VMs. I'm not talking about about a new designed-by-huge-committee version of CORBA.

2. Where XML *must* be used, there should be standardised, platform-based support for the interchange of XML documents. For example, if I am transmitting XML over a network, the system should be able to know that this is XML data (e.g. the programmer could tell it), and should be able to decide for itself that the XML should be compressed (because spending compression time vs the network speed and latency is likely to be a net win for this particular document). When the XML arrives at the other end, the system should be able to inflate the data in a sensible way. E.g. perhaps the application could use the compressed XML directly; perhaps the user is viewing it with Notepad, so it must be uncompressed. The system should take care of these cases, and the raw XML should be human-readable when it is needed. The compression/decompression process should be transparent to both programmer and user.

"The thing to note is that XML was successful because it is similar to HTML."

I think it's too early to say that XML has been successful, and certainly too early to put it in past tense! For now, a large number of people are using XML in some capacity in new projects, mostly web pages, but don't forget that the vast majority of the web is still (and will continue to be for some time to come) in various versions of HTML. I think some of the people using it for non-web things will stop doing so - for most of them the disadvantages and changes necessitated by XML outweigh the perceived advantages. My take is the only place XML will have a lasting impact like HTML has is on the web.

The overrun bugs are bugs in the implementation of the Packed Encoding Rules (PER). They are not due to any flaws in the specification. I don't know much about the PER. I do know that the performance issues with the BASIC Encoding Rules (BER) were pretty bad in the early 90s. Improved algorithms changed this in the mid 90s.

The article you refer to was written at an early stage of the development of the Fast Infoset standard (X.891). The current draft of the standard does not use PER at all. The standard uses ASN.1 and ECN as the formal notation for specifying data structures, but the actual encodings are "customized" and heavily optimized for speed, ease of implementation, and compactness.

One can implement the Fast Infoset standard without possessing a deep knowledge of ASN.1, and can certainly implement it without knowing ECN at all, because the standard contains an annex that describes the encodings in full detail. PER is not used, as I said.

I am sorry if you have been misled by an outdated article. As soon as the standard is approved, it will be made available for free on the ITU-T website.

(Note that I am not related to Sun. My company (OSS Nokalva) and Sun are members of the ISO/IEC/ITU-T standards committee that is developing the Fast Infoset standard and the Fast Web Services standard.)

There is another Binary XML format with essentially the same ideas as you had. It's being pushed by a GIS (Geographic Information System?) company called CubeWerks and they've submitted it to some GIS consortium thingy.

Gzip doesn't make things parse faster. In fact, it slows things down still further since after uncompressing you still have to parse the original XML. Gzip simply compresses the textual content. A binary-encoding of XML can make the parser faster and make the binary representation more dense.

Please. IETF protocols use all sorts of encodings, from text-ish encodings that have given implementors huge headaches (RFC821/822, anyone?), ad-hoc binary encodings (SSHv2) to XDR (NFS), ASN.1/BER/DER (Kerberos V, PKIX, and so on), to XML, things built atop XML (BEEP), and what not. There have been serious implementation bugs with most if not all of these. PER is, IMO, a fairly simple encoding. What makes ASN.1 difficult to deal with is its syntax, which makes writing compilers for it fairly painful, thus the dearth of open source compilers for it; the encodings are not all that special (think of XDR as PER-like + 4-octet alignment and minus a few features).

All of these things are bad re-inventions of the S-expression, in a way. So having re-invented the wheel for the umpteenth time with XML, why re-invent the wheel again when the time came to make an efficient encoding of XML? Why not pick an existing spec?

That ASN.1 and XML can map onto each other (XER = ASN.1->XML, Fast InfoSet = XML-> ASN.1/PER) is an indictment of our propensity to re-invent the wheel, out of ignorance or worse. A lot of thought and experience went into PER, so why should the authors of Fast InfoSet throw that out the window?

Finally, just because Fast InfoSet uses PER as the encoding and ASN.1 for its semantic mapping model doesn't necessarily mean that implementations must internally deal with ASN.1 syntax.

Probably not one for the ASN.1 purists out there but if you are interested in a simple s-expression based encoding tool check out packedobjects.com

Very much work in progress but:

- is meant to fit somewhere in between hand encoding and more formal tools- based on unaligned PER- very simple API- runs on many Unix platforms including embedded Linux- Scheme based so can run from an interpreter- free