When Mike Stonebraker and I discussed RDF yesterday, he quickly turned to suggesting fast ways of implementing it over an RDBMS. Then, quite characteristically, he sent over a paper that allegedly covered them, but actually was about closely related schemes instead. Edit: The paper has a new, stable URL. Hat tip to Daniel Abadi.

All minor confusion aside, here’s the story. At its core, an RDF database is one huge three-column table storing subject-property-object triples. In the naive implementation, you then have to join this table to itself repeatedly. Materialized views are a good start, but they only take you so far.

Subsequent implementation ideas exploit the fact that the set of possible properties is often small and known. Thus, in principle you can set up a bunch of two-column subject-object tables, one for each property. This is one idea discussed in the paper. Or you can selectively group the properties together into fewer wider tables. This is another idea discussed in the paper. Or you can create one super-wide table, with one column for each possible property. That’s the idea Mike actually advocated to me on the phone.

This latter table is obviously very sparse; even though it has a large number of columns, each row will contain exactly three non-null values. But it so happens that a column store such as Vertica is beautifully suited for such a schema. Nulls get compressed to nothingness, automagically, for “free”. There’s a built-in sort on every column, making self-joins very fast. The “extra” columns are generally irrelevant to performance when you’re only really interested in three at a time. And adding a new column to the schema after-the-fact isn’t hard at all.

Perhaps not coincidentally, two of the paper’s four authors seem to be involved with Vertica.