Arbitrarily-Structured Data in Relational Databases

The goal is not to build an effective schemaless database on top of a relational database, but rather to accomodate for rapidly-evolving relational schemas and reducing the difficult of migrating forward.

Hypothesis

In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often.

Solution

Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved.

Imagine a standardized table structure where each table has the following columns: id, time_created, time_updated, _data, and additional “index columns”.

The _data column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against.

Example

A typical user table might have the following columns (using an SQLAlchemy declarative model):

-- Load the user object from the current session (where we store the user_id)
SELECT * FROM user WHERE id = :user_id;
-- Check the given password against the email address, for login
SELECT password_hash, password_salt FROM user WHERE email = :user_email;

And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as user.display_name or user.email regardless whether it’s an extracted indexed property or a buried _data element.

Process: Adding an index

Build a table with just a free-structure _data field.

Determine queries, extract relevant properties into indexed columns:

ALTER TABLE to add the column

Run full-scan query to populate new column with data

Add relevant index onto said column

Deprecate property from _data (optional, we could just assume that proper columns always supercede _data attributes)

Concerns

Adding an index may affect database performance during the process, due to the data locality. With FriendFeed’s approach, the indices are stored in their own tables which could even be sharded across databases, so this removes any performance concerns during schema transitions. On the other hand, FriendFeed’s approach reduces data locality which increases the number of joins required to get desired data, and also reduces the semantic meaning of the tables thus making queries more complex.

Performing unexpected demographic analysis on large datasets would be much slower if the fields are stored in _data (such as age, gender, etc) since it would require a full table scan instead of an in-database aggregate query.

Parsing and storing the _data dictionary has some performance implications, too. cPickle is best for parsing performance, perhaps zlib-compressed cPickle is best for data size and parsing performance tradeoff, but no portability beyond Python. zlib-compressed JSON reduces data size and is portable across languages, but increases parsing time. Also, this could be done natively for PostgreSQL using HSTORE (via jek)