News

Rationale

Few years ago, server-side JavaScript was unimaginable. Today, at the beginning of 2012, more and more businesses increasingly rely on high-performance, low-development-costs, short time-to-market, and explosively growing ecosystem of libraries of the Node.js platform. However Node.js is not an exception, but rather a confirmation of the rule that JavaScript is the most potent environment for software evolution available today. Other notable JavaScript ecosystems with explosive growth are Firefox Extensions, OS X Dashboard Widgets, Chrome Extensions, and of course the client side of the web, with millions of libraries, frameworks and applications.

However, a very important area where this kind of explosive evolution is desperately needed but where it is not happening, is the area of database development. We only have a handful of projects to choose from, and even fewer architectural models. Problems like clustering, interfacing, query languages, persistence strategies, etc. are currently mostly in the domain of lower-level languages. Instrinsic datastores for Node.js is an attempt to support this portion of the Node.js ecosystem.

What we inevitably see during the course of evolution of almost any database product, is its extension with some form of a secondary language (the query language being the primary one). This comes either in the form of stored procedures (e.g. T-SQL, PL/SQL, etc.), or a scripting language (e.g. Lua in Redis).

So the idea here is to bring datastore functionality and scripting into the same process, the same way as we see it with dedicated databases in form of stored procedures, but this time from the other way around - bring the database to the scripting environment:

Advantages, when building standalone database servers:

Utilization of the Node.js platform and its ecosystem to evolve database products.

Advantages, when using this approach to join the application and the database layer:

The OS will not have to process the extra TCP/IP or IPC that occurs with out-of-process databases.

In-VM

nStore - "uses a safe append-only data format for quick inserts, updates, and deletes. Also a index of all documents and their exact location on the disk is stored in in memory for fast reads of any document"

node-tiny - "largely inspired by nStore, however, its goal was to implement real querying which goes easy on the memory"

Scratchpad

Is v8 good for in-memory data storage? Data would be first class citizen and a lot of wheel-reinventing could be avoided. v8 translates JS directly into machine code, how to best leverage this? -- A simple test on Node v0.7.4 Mac revealed that on my 2,8GHz dual core machine, about 40M objects is where v8 starts to choke. Given the high level of optimization already done in v8, it's probably safe to assume that any significant improvement of v8's GC on the current architecture is not possible, at least not without adding significant memory usage overhead, or without significant rewrite of the current implementation. New possibilities are on the horizon in the form of GPU-supported GC, albeit patent-encumbered (see here and here), but we've already seen a lot worse situations where patent-free solutions were developed working-around the existing patents, e.g. WebM vs. H.264, and many others. However at the present time (2/2012), it is best not to consider node/v8 as a good storage for large number of objects.

Plans

decide on a good primitive data structures and operation set which would allow to model most used DB cases, including pub-sub

binary safe keys and data

map, ordered-map, deque

key timeouts

event emitter

atomic ops? transactions? (plan ahead for the concurrent impl?)

provide drop-in replaceable implementations with varying tradeoffs:

all data in memory (with or without secondary storage) - fastest, RAM-limited

all keys in memory - still pretty fast, not-so RAM-limited

keys and data in memory on-demand

provide further variations:

single-process - fast, but multiple cores and multiple Nodes cannot work with the same data, clustering must be applied

shared-memory implementation - certain overhead and latency but higher total performance up from a certain number of cores (atomic ops and async API necessary at this point)