Thursday, April 29, 2010

So I have been playing with various database systems, many pertaining to the NoSQL category. Here are some thoughts:

Mongo is cool. I definitely like it. However, it differs from Cache in one important way: JSON objects are native to JavaScript. To everything else, they are just a text format (that Python can understand easily, true) not necessarily any more efficient than XML. Cache objects are more or less portable across languages and the impedance mismatch between the consumer and the database is definitely much less significant than in the case of Mongo;

Mongo is fast, and easy enough to understand for perhaps a dozen or two 'collections'. I am not sure how well it would support (or perform) with a 3000-table schema, which is not at all unlikely in an enterprise application. While the proliferation of tables is a perverse effect of relational normalization, the fact is that the relational model is easy to understand. Complex text representations of object hierarchies, which Mogo really allows for, might quickly spin out of control (assuming that the schema is kept under control by restricting access to the database through the front end, and object collections to not degenerate to the point of being simple bit buckets);

so Mongo might be best appropriate in an environment with a few deep entities with loose connections: e.g. 12-25 'tables' with million+ rows, especially for client apps that can read JSON (or derivatives: such as Python's collection objects) more or less natively.

...or how many buzzwords can you get in one title. Here is a shortish piece of code that pulls data from Twitter and inserts it into Mongo. Other than the shortness of the code (given what it accomplishes!), what is remarkable is the ease of use of the data that is passed around, with a minimum amount of marshalling: Twitter can return data in JSON which is the native Mongo format and Python can use with a minimum of tweaking (mostly to reduce the response from Twitter).

The similarities aren't surprising perhaps; it is the differences that trouble me (in this case, Mongo's looseness - lack of structure); although SQLite was the first one to go down that path, by not enforcing strict data typing, and now Mongo doesn't even enforce schemas. A discussion on Mongo database design principles here.

For now I have a couple of other questions:

is there a reporting tool that binds to JSON/Mongo natively?

how do you update an existing JSON entry? just one tuple, not the entire record; some notes:

var p = db.coll.findOne();

p.member (notation supported, p is an object already and there is no need to eval() it; originally, say p{member:"y"} ) = "x" and now p is disconnected from the collection, but db.coll.save(p)does update it in place

What is cool is that you can save JS objects (declared using the JS object notation):

function pobj(param){this.p1=param;}

var newObj = new pobj("test");

db.coll.save(newObj);

db.coll.find(); returns { "_id" : ObjectId("4bd722a6eb29000000007ac4"), "p1" : "test" }. You can even 'serialize' objects' methods, and then call the method for the objects deserialized using findOne. All of this might be JS-specific candy, I am curious how this ports over to other language drivers.

So you can view Mongo as a (JS) object-oriented database, with nothing in the way of SQL facilities though; a tuple serialization mechanism; a key-value pair list; a 'document'/hierarchical database using JSON as the document format (as opposed to xDB's XML), all of which are correct.

Another question: when you have an embedded object, var ptrUser = {name : "Mr Iwata", address : { city : "Tokyo }}, how do you search by the inner object properties? db.coll.find({address:{ city : "criteria" }} does not seem to work.RTfM

Also, if you store objects with different structures in one collection, they can be inspected:

Sunday, April 11, 2010

There is perhaps a more Pyhton-specific way of storing the data to be loaded into the Mongo database: a list of dictionaries. In this case, a dictionary is defined as {'name':__name__, 'service':__service__, 'web':__web__}.To add an element to the holding list (say, NHS): NHS.append({'name':'Wigan General', 'service':5, 'web':None}). Then, a function can be defined which will return the index of the list containing the element matching its parameter; i.e.:

>>> def idx(ky, val):

for item in NHS:

if item[ky] == val:

return NHS.index(item)

Usage:

>>> print idx('name', 'Wigan General')will yield Wigan's index in the list. I'm quite curious how fast this is with several thousand records! But Python's ability to easily make sense of a complex data structure is impressive.

Another way of searching, using list comprehensions:

>>> def idx2(ky, val):

lstIdx = [item[ky] == val for item in NHS]

return lstIdx.index(True)

It would also be interesting to know if the bytecode generated by Python is different between the two.

Saturday, April 10, 2010

Using Python, NHS open data (NHS Choices), and Mongo: for example, getting the name and the web sites of all the providers in the Wigan area (why Wigan? No idea, just that their football team seems to be pretty bad recently defeated Arsenal, and the name stuck with me).

Start the database: go to the bin subdirectory of the install directory, and type mongod –dbpath .\

The response will have a 3-item dataset, the service type, the provider name, and the web site (if one exists).

Mongo is a bit different in that the 'server' does not create a database physically until something is written to that database, so from the console client (launch, in \bin\: mongo) you can connect to a database that does not exist yet (use NHS in this case will create the NHS database - in effect, it will create files named NHS in the current directory).

Yet another non-relational database. There definitely seems to be a trend in that direction, a market corner that the RDBMS's do not address convincingly. I would guess that for a general purpose system, which might have power users/report writers, a full-fledged database makes sense, but for a turnkey system where speed is important and there is no need to expose too much of the internals to the users, this type of offering is useful.

Hence, streams are supersized binary (%Binary) or character (%String) types. The difference with other databases systems is that streams in Cache can be stored in external files or database global variables.

The following code:

Class CM.Patient Extends %Persistent{

Property PatientName As %String;

Property PatientData As %FileCharacterStream(LOCATION = "D:/");

}

Executed in Terminal:

> DO %^CD

> S p = ##class(CM.Patient).%New()

> DO p.PatientData.Write("Test")

Will create a temporary file on drive D: where the "Test" string will be written to; when the Terminal session ends, the file will be deleted as the object is garbage-collected. In the order for the stream to persist, you have to %Save() the object.