Moo-able Type for Cowtowncoder.com

Tuesday, January 20, 2009

To continue with the thesis of "exactly 3 methods to process structured
data formats (including Json)", let's have look at the first alleged
method, "Iterating over Event Streams" (for reading; and "Writing to an
Event Stream" for writing).I must have already written a bit
about this approach, given that it is the approach that Jackson
has used from the very beginning. But, as romans put it: "Repetitio est
mater studiorum". So let's have a (yet another) look at how Jackson
allows applications to process Json content via Stream-of-Events (SoE ?)
abstraction.

1. Reading from Stream-of-Events

Since Stream-of-Events is just a logical abstraction, not a concrete
thing, first thing to decide is how to expose it. There are multiple
possibilities; and here too there are 3 commonly used alternatives:

As iteratable stream of Event Objects. This is the approach taken by
Stax Event API. Benefits include simplicity of access, and object
encapsulation which allows for holding onto Event objects during
processing.

As callbacks that denote Events as they happen, passing all data as
callback arguments. This is the approach SAX API uses. It is highly
performant and type-safe (each callback method, one per event type,
can have distinct arguments) but may be cumbersome to use from
application perspective.

As a logical cursor that allows accessing concrete data regarding one
event at a time: This is the approach taken by Stax Cursor API. The
main benefit over event objects approach is the performance (similar
to that of callback approach): no additional objects are constructed
by the framework; and the application has to create objects if it
needs any. And the main benefit over callback approach is simplicity
of access by the application: no need to register callback handlers,
no "Hollywood principle" (don't call us, we call you), just simple
iteration over events using the cursor.

Jackson uses the third approach, exposing a logical cursor as
"JsonParser" object. This choice was done by choosing combination of
convenience and efficiency (other choices would offer one but not both
of these). The entity used as cursor is named "parser" (instead of
something like "reader") to closely align with the Json specification;
the same principle is followed by the rest of API (so structured set of
key/value fields is called "Object", and a sequence of values "Array" --
alternate names might make sense, but it seemed like a good idea to try
to be compatible with the data format specification first!).

To iterate the stream, application advances the cursor by calling
"JsonParser.nevToken()" (Jackson prefers term "token" over "event"). And
to access data and properties of the token cursor points to, calls one
of accessors which will refer to property of currently pointed-to token.
This design was inspired by Stax API (which is used for processing XML
content), but modified to better reflect specific features of Json.

So the basic ideas is pretty simple. But to give better idea of the
details, let's make up an example. This one will be based on the
Json-based data format described at http://apiwiki.twitter.com/Search+API+Documentation
(and using first record entry of the sample document too), but using
some simplifications (omitting fields, renaming).

{
"id":1125687077,
"text":"@stroughtonsmith You need to add a \"Favourites\" tab to TC/iPhone. Like what TwitterFon did. I can't WAIT for your Twitter App!! :) Any ETA?",
"fromUserId":855523,
"toUserId":815309,
"languageCode":"en"
}

And to contain data parsed from this Json content, let's use a container
Bean like this:

Ok, now that's quite a bit of code for a relatively simple operation. On
plus side, it is simple to follow: even if you have never worked with
Jackons or json format (or maybe even Java) it should be easy to grasp
what is going on and modify code as necessary. So basically it is
"monkey code" -- easy to read, write, modify, but tedious, boring and in
its own way error-prone (because of being boring).Another and
perhaps more important benefit is that this is actually very fast: there
is very little overhead and it does run fast if you bother to benchmark
it. And finally, processing is fully streaming: parser (and generator
too) only keeps track of the data that the logical cursor currently
points to (and just a little bit of context information for nesting,
input line numbers and such).

Example above hints at possible use case for using "raw" streaming
access to Json: places where performance really matters. Another case
may be where structure of content is highly irregular, and more
automated approached would not work (why this is the case becomes more
clear with follow-up articles: for now I just make the claim), or the
structure of data and objects has high impedance.

2. Writing to Stream-of-Events

Ok, so reading content using Stream-of-Events is simple but laborious
process. It should be no surprise that writing content is about the
same; albeit with maybe just a little bit less unnecessary work. Given
that we now have a Bean, constructed from Json content, we might as well
try writing it back (after being, perhaps, modified in-between). So
here's the method for writing a Bean as Json:

So as can be seen from above, using basic Stream-of-Events is quite
primitive way to process Json content. This results in both benefits
(very fast, fully streaming [no need to build or keep an object
hierarchy in memory] easy to see exactly what is going on) and drawbacks
(verbose code, repetitive).

But regardless of whether you will ever use this API, it is good to at
least be aware of how this works: this because is what other interfaces
build on: data mapping and tree building both internally use the raw
streaming API to read and write Json content.

And next: let's have a look at a more refined method to process Json:
Data Binding... stay tuned!