Pages

Sunday, January 6, 2013

In my professional life I'm working on a server side appengine based system whose next iteration needs to be really good at dealing with schema-less data; JSON objects, in practical terms.
To that end I've thrown together a simple document database layer to sit on top of appengine's ndb, in python.

gaedocstore

gaedocstore is a lightweight document database implementation that sits on top of ndb in google appengine.

Introduction

If you are using appengine for your platform, but you need to store arbitrary (data defined) entities, rather than pre-defined schema based entities, then gaedocstore can help.

gaedocstore takes arbitrary JSON object structures, and stores them to a single ndb datastore object called GDSDocument.

In ndb, JSON can simply be stored in a JSON property. Unfortunately that is a blob, and so unindexed. This library stores the bulk of the document in first class expando properties, which are indexed, and only resorts to JSON blobs where it can't be helped (and where you are unlikely to want to search anyway).

gaedocstore also provides a method for denormalised linking of objects; that is, inserting one document into another based on a reference key, and keeping the inserted, denormalised copy up to date as the source document changes. Amongst other uses, this allows you to provide performant REST apis in which objects are decorated with related information, without the penalty of secondary lookups.

Simple Put

When JSON is stored to the document store, it is converted to a GDSDocument object (an Expando model subclass) as follows:

Say we are storing an object called Input.

Input must be a dictionary.

Input must include a key at minimum. If no key is provided, the put is rejected.

If the key already exists for a GDSDocument, then that object is updated using the new JSON.

With an update, you can indicate "Replace" or "Update" (default is Replace). Replace entirely replaces the existing entity. "Update" merges the entity with the existing stored entity, preferentially including information from the new JSON.

If the key doesn't already exist, then a new GDSDocument is created for that key.

The top level dict is mapped to the GDSDocument (which is an expando).

The GDSDocument property structure is built recursively to match the JSON object structure.

Simple values become simple property values

Arrays of simple values become a repeated GenericProperty. ie: you can search on the contents.

Arrays which include dicts or arrays become JSON in a GDSJson object, which just hold "json", a JsonProperty (nothing inside is indexed, or searchable)

Dictionaries become another GDSDocument

So nested dictionary fields are fully indexed and searchable, including where their values are lists of simple types, but anything inside a complex array is not.

This will create a new person. If a GDSDocument with key "897654" already existed then this will overwrite it. If you'd like to instead merge over the top of an existing GDSDocument, you can use aReplace = False, eg:

lperson = GDSDocument.ConstructFromDict(lperson, aReplace = False)

Simple Get

All GDSDocument objects have a top level key. Normal ndb.get is used to get objects by their key.

Querying

Normal ndb querying can be used on the GDSDocument entities. It is recommended that different types of data (eg Person, Address) are denoted using a top level attribute "type". This is only a recommended convention however, and is in no way required.

You can query on properties in the GDSDocument, ie: properties from the original JSON.

Querying based on properties in nested dictionaries is fully supported.

Denormalized Object Linking also supports pybOTL transform templates. gaedocstore can take a list of "name", "transform" pairs. When a key appears like

{
...
"something": { key: XXX },
...
}

then gaedocstore loads the key referenced. If found, it looks in its list of transform names. If it finds one, it applies that transform to the loaded object, and puts the output into the stored GDSDocument. If no transform was found, then the entire object is put into the stored GDSDocument as described above.

eg:

Say we have the transform "address" as follows:

ltransform = {
"fulladdr": "{{.addr1}}, {{.city}} {{.zipcode}}"
}

You can store this transform against the name "address" for gaedocstore to find as follows:

GDSDocument.StorebOTLTransform("address", ltransform)

Then when Person above is stored, it'll have its address placed inline as follows:

And if the object is recreated in the future, then that linked data will be reinstated as expected.

Similarly, if an object is saved with a link, but the linked object can't be found, "link_missing": True will be included as above.

updating denormalized linked data back to parents

The current version does not support this, but in a future version we may support the ability to change the denormalized information, and have it flow back to the original object. eg: you could change addr1 in address inside person, and it would fix the source address. Note this wont work when transforms are being used (you would need inverse transforms).

storing deltas

I've had a feature request from a friend, to have a mode that stores a version history of all changes to objects. I think it's a great idea. I'd like a strongly parsimonious feel for the library as a whole: it should just feel like "ndb with benefits").