This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.

XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:

1:<?xmlversion="1.0"encoding="UTF-8" ?>

2:<SOAP-ENV:Envelope

3:xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"

4:xmlns:xsd="http://www.w3.org/1999/XMLSchema"

5:xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">

6:<SOAP-ENV:Body>

7:<ns1:getEmployeeDetailsResponse

8:xmlns:ns1="urn:MySoapServices"

9:SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">

10:<returnxsi:type="ns1:EmployeeContactDetail">

11:<employeeNamexsi:type="xsd:string">Bill Posters</employeeName>

12:<phoneNumberxsi:type="xsd:string">+1-212-7370194</phoneNumber>

13:<tempPhoneNumber

14:xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"

15:xsi:type="ns2:Array"

16:ns2:arrayType="ns1:TemporaryPhoneNumber[3]">

17:<itemxsi:type="ns1:TemporaryPhoneNumber">

18:<startDatexsi:type="xsd:int">37060</startDate>

19:<endDatexsi:type="xsd:int">37064</endDate>

20:<phoneNumberxsi:type="xsd:string">+1-515-2887505</phoneNumber>

21:</item>

22:<itemxsi:type="ns1:TemporaryPhoneNumber">

23:<startDatexsi:type="xsd:int">37074</startDate>

24:<endDatexsi:type="xsd:int">37078</endDate>

25:<phoneNumberxsi:type="xsd:string">+1-516-2890033</phoneNumber>

26:</item>

27:<itemxsi:type="ns1:TemporaryPhoneNumber">

28:<startDatexsi:type="xsd:int">37088</startDate>

29:<endDatexsi:type="xsd:int">37092</endDate>

30:<phoneNumberxsi:type="xsd:string">+1-212-7376609</phoneNumber>

31:</item>

32:</tempPhoneNumber>

33:</return>

34:</ns1:getEmployeeDetailsResponse>

35:</SOAP-ENV:Body>

36: /SOAP-ENV:Envelope>

After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.

But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.

Let us take an example. the following is a JSON document representing one of RavenDB builds:

It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.

I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.

To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.

And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.

Oh, forgot to mention that when I count the characters I ended up on roughly 4500 and 4000. The only way to get the 3000 characters of JSONH is when you ignore unnecessary spaces, which is a 'compression' you should also get for free with a regular JSON document.

For starters, you are comparing a strongly-typed set of data with loosely typed.

Then lets consider why that particular XML is so wordy - it's got data types on every item! Why? You can declare a schema up front and then the element data types will still be strongly typed.

I ran into the same issue with supposed compaction when dealing with geological data in XML. There were concerns about a format which recorded lab results for assay samples. I spent a day refining a schema to get a 600KB sample file down to about half its size. Then I compared a zip of the original with a zip of my reduced file - 39KB vs 36KB!

(Note that data which is mainly lots of different floating point numbers doesn't zip as well as some other plain text so this zip ratio is, if anything, on the high side.)

It feels like the backlash against XML was due to the enterprise mess it had become in some instances. I find "simple" XML to be quite readable and built in tool support is better than for json on almost all platforms. Doc size is marginally larger than JSON and compression makes them even closer. Throwing out XML because of horrible SOAP formats is like throwing out the jvm because off struts. So the choice between json and XML I find arbitrary. If you care about size you choose neither, and if you can't make your doc readable in both XML and json you shouldn't have chosen a text format.

Google's Protocol Buffers came to mind...
By the way, YAML is human readable too (4.2KB). Choosing between JSON and YAML is just a matter of applicability to the actual solution (where in many cases JSON wins).

I found a very specific use case for JSONH that worked very well. If your browser app is sending/receiving javascript that contains arrays of homogeneous JSON objects then JSONH compresses things very well. Compression gets better as the number of properties in the homogenous JSON object increases or the number of objects in the arrays gets really large. This is because JSONH takes the property names out of the arrays and moves them into a schema-like set of properties in an outer JSON object. Thus it removes the property name duplication in the arrays. The 14 on line 2 is the number of properties in the JSON array, followed by the 14 property names. It is true, I would never use JSONH for storage because it makes it much more difficult for a human to read.

In my specific use case, JSONH was 3 times faster than a client-side dictionary compression and compressed the JSON very well (70-74% in some cases). A big win when dealing with browser traffic. At a later date, this application will move to streaming the JSON in smaller increments and then the JSONH will very likely go away. JSONH has delayed of some of that pain for now.

Ayende, yes gzip compression solves the problem in the browser, but only for responses from the server back to the browser. There is no corresponding gzip option for GET/POST with a large request. I would absolutely love it if browsers recognized Content-Encoding: gzip and would automatically gzip requests. There are of course other options for large requests, plugins (sketchy at best), websockets (promising, maybe someday, still no compression though), write to file & and transfer (not exactly kosher or well supported). For us, JSONH was a nice option because it compressed our AJAX request payload well but was still valid JSON.

One more thing. I mentioned before that we tried several different javascript compression libraries. All of them work quite well in Nodejs scenario, but kinda fall flat in the browser. I think this has to do mainly with trying to emulate 8-byte binary read/writes in the browser. Certain browsers (uhm - IE9 and below) really struggle with the code and turn out to be be very slow. 5x slower in IE 8 & 9, 2.3x slower in IE10, FF20+ and Chrome 18+ do pretty well with client side compression, about ~1.3x slower than JSONH in the browser. Once you have a compressed 8-byte emulation buffer, what do you do with it though? Send it via AJAX as what encoding exactly? GZIP encoding was not recognized as valid GZIP by our server (Python Tornado in our case).

I tried Content-Encoding: gzip but the server still didn't accept it from the browser. We could send a gzipped file upload and the server would recognize. We also sent a gzipped request from another server app, so we know it was not the web server. I never had a chance to compare the browser generated request and the server app generated request in Fiddler. One day in my spare time (hah, did I really say that?) I need to go back and compare the two.

'But yes, it is not easy to do in the browser.' -- that was the kicker for this project. You can compress requests, but browser support for this feature is sadly lacking.