Time is confusing. Sometimes, it's so confusing that our thinking about it changes... over time. Which makes
it triply confusing, or something. In this post, I want to talk about two aspects of temporal data, both
common, useful and often misunderstood:

Homogeneous vs Inhomogeneous

Temporal vs Bitemporal

Homogeneous Temporal Data

The first thing to stress is that we're talking about
homogeneous data, because "homogenous" means something completely different.

In a homogeneous, temporal data set, one column will contain a known sequence of
discrete time values, and the other
columns will contain whatever information is associated with those times.

You would
query for an exact match in time, e.g., in SQL

select*wheretime=3

and expect to find data exactly where you asked for it.
Such a data set might contain a daily closing stock price, or the maximum temperature observed in each
hour of the day -- anything, as long as the time values are discrete and known, and you expect to find
data at every one of them.

Inhomogeneous Temporal Data

By contrast, inhomogeneous temporal data holds observations that occur at arbitrary times.
You typically query for information as of a point in time, expecting to get back the most
recent observation that isn't after the query time:

In SQL, the query is a bit awkward

select*wheretime<=3.25orderbytimedesclimit1

but would be reasonably efficient in many databases, as long as time was indexed properly.
There's an an O(log N) to find the first acceptable time value, and then we just return it. (Some
databases will pathologically insist on realizing the entire set of previous times, in which case
you'll have to do something cleverer.)

Such a data set might contain posted limit orders for a given stock, or someone's home address indexed by
the time they moved into it. The time values are generally not known in advance, and you certainly don't
expect to find data at every possible time.

Bitemporal Data

Bitemporal data is inhomogeneous data where we care not just about the time to which the data pertains, but
also about the time when we started thinking that did.

True story. Earlier this year, I decided to write down the recent history of my residences:

Last week, a friend reminded me that I was briefly the Prime Minister of England, so I had to revise the
list. Then, last night while I was falling asleep, I suddenly remembered that I had moved in with the
Browns on Hallowe'en, and that their address was actually 32. At this point, it seemed like a good idea to
record what I had thought and when, if only so I could reconstruct which lies I had told on
which credit card applications:

In the spirit of academic rigor, I labeled my columns with the standard convention of TV for
the time to which the value pertains and TT for the transaction time when it was
entered into the ledger.

Purely functional queries

Enough about me; let's switch to some fictional data. The TV-axis below denotes the time at which
a measurement was taken. The value of the measurement is recorded as vertical height, and the time
it was recorded is on the TT-axis.

As of TT=10, we knew about points at TV=0.6, 1.5, 2.5, 4.4 and 5.6; at TT=12, we learned about a new
measurement at TV=3.5; at TT=17, we revised the measurement at TV=5.6 to a slightly higher value.

Now, imagine a query for TV=4, TT=10. It's not particularly graceful to cram this into SQL, but
if we did, the query would be something like:

select*whereTV<=4andTT<=10orderbyTVdesc,TTdesclimit1

Note that this won't always be efficient, for reasons we'll discuss below. Graphically,
the blue arrows below start at the point of the query and end at the resolved value for queries
as of (TV=4, TT=10) and (TV=6, TT=10):

If we executed the same queries but with TT=20, the resolutions would be different, as shown in
red:

Now, the most recent value visible as of TV=4 is from 3.5 rather than 2.5; similarly, the query at
TV=6 continues to see the point from 5.6, but now it has been "overwritten" to a different value.

Of course, nothing was really overwritten: if we repeated the query at TT=10, we'd still get back the
blue results. If we can promise that

transactions occur at strictly increasing TTs and

we never query for TT in the future beyond the next transaction TT

which is most easily achieved by

using the physical clock time for TT in transactions and

never querying for TT in the future,

then a query for any given (TT,TV) pair will always return the same result, and querying
is a purely functional operation.

Asymmetry

Here's a 2D view from below of a similar dataset: TV is on the x-axis;
TT is on the y-axis, and we're recording the color of a lamp, which can be green, blue, red or splotchy
purplish.
Here, the region below the diagonal line is forbidden by causality, since we can't possibly record something
before we observe it.

As of TT=3, there are two observations recorded, green at (TV=1.9, TT=2.9) and blue at (TV=2.3, TT=2.7)
A query for (TV=3, TT=3) will find the blue point at (TV=2.3, TT=2.7).

Why not green at (1.9, 2.9), which
is further from TV but closer to TT? Because
TT and TV are not precisely symmetric. We are asking the question, "at TT=3.0, what did
we think the color of the lamp was at TV=3.0?".
The alternate question, "what is the most recently recorded information as of TT=3.0
concerning the lamp's status at some time at or before TV=3.0?" is poseable, but not usually what we want.

At 3.3, we we spoke to a passing
samurai
who told us that, notwithstanding
previousreports,
the lamp was actually red.
Finally, at 4.25, a local
woodcutter
suggested that we split (get it?) the difference and call it splotchy purple.

Based on this revision history, queries for TV=3 as of TT=3, 4 and 5, respectively, would
return blue, red and purple.

Efficiency

As noted above, a single-sided SQL range query for both TV and TT is likely to be quite wasteful.
Typically, the optimizer will

Pick one of the two temporal columns, say TV for concreteness, but it could be either.

Do an O(log N) search to find the most recent acceptable TV value.

Scan down through TV, assembling a realization of all M rows matching the TT and TV constraints, taking
O(N) time and O(M) space.

Sort them by TT, taking O(M log M) time.

Take the first and discard the rest of the realization.

This is unacceptable for all but the tiniest data sets.

For many use cases, we can do better. Commonly, we can assume:

TT >= TV - i.e. we are storing observations, not predictions. (As in the examples above.)

Revisions are infrequent.

Typically, then, every datapoint will have just one observation, with TT slightly later than its TV
This can be then be nicely represented in a
multi-column index, which essentially
stores a sorted structure of concatenations of TV and TT. For the address example, it would look like

The latter is a private, persistent, immutable object, and nothing anybody else does, now or in
the future, can affect the results of queries we make against it, as long as its in scope. Of course,
if we call d/db sometime later, its return value will "contains" any transactions executed
in the interim. We can, if we want, be precise about exactly what interim means:

(def db (d/as-of (d/db conn) tx-time))

All dbs thus obtained with a particular tx-time will contain exactly the same data.
By implication, anything placed into Datomic is immutable. An existent value can't change, but a new
one can be added.

For consistency, transaction time is strictly increasing across the entire system;
it is exactly what we're looking for in a TT. TV, however, we'll have to roll ourselves.
As a concrete example, let's implement a simple bitemporal store. It could hold things like
stock prices for random tickers at random times of the day,
colors of lamps, addresses - anything observed at discrete times, then recorded and
possibly revised later. The code, such as it is, is on
github;
what I show here will be abbreviated for clarity.

Stripped of some boilerplate (i.e. this won't actually run), the schema defines four attributes:

[{:db/ident:bitemp/k:db/valueType:db.type/string:db/doc"Non-temporal part of the key"}{:db/ident:bitemp/tv:db/valueType:db.type/instant:db/doc"Time for which data is relevant":db/indextrue}{:db/ident:bitemp/index:db/valueType:db.type/string:db/doc"This is the mysterious and exciting part.":db/unique:db.unique/identity}{:db/ident:bitemp/value:db/valueType:db.type/string:db/doc"The value"}]

The k field is the non-temporal identifier of whatever we're measuring. It could be a stock
ticker, the name of the person whose address we're tracking, the location of colorful lamp, etc.

We explicitly keep tv; its instant type is basically a java data, which is basically
a number of milliseconds since the great epoch. Datomic will take care of TT.

Per the comments, :bitemp/index attribute is mysterious and exciting, so ignore what it does for the
moment.

This finds a database as of tt, looks in it for an entity (as inserted above) with a
specific k and tv, binds its ``vto?v``` and returns it.

Generally, we don't know exactly what TV to look for, which is where index comes in.
Datomic has an index-range function, which, for an indexed attribute, returns a lazy sequence
starting at the first entry greater than or equal to a specified value and increasing.
We want to devise entries
for this index such that this function will quickly locate that an exact match for k and
the first tvless than or equal to a target. The comparison value of our entries must
therefore decrease with increasing tv. We construct the field as follows:

The index will look like "e757fd4fedc4fe825bb81b1b466a0947-^@^@(#&%($"
Before the hyphen, we have
the md5 hash of the non-temporal part of the key. After the hyphen, we have TV in milliseconds,
subtracted from 10^14, and crammed into 8 characters.

For an exact k and a target tv, I
can construct the string and take the head of the sequence returned by index-range:

The (or (and e (.startsWith es kh)... business takes care of the possibilities that
index-range returned false, or that there were no appropriate tvs, so it fell through
to the next k.

When we call get-at, the TT part of the query is handled by Datomic's as-of, while the
TV part uses our fancy composite index.

Magic, part II (or the lack thereof)

It's all very well and good to say that Datomic is "handling" the TT query, but how exactly
is it doing that? Datomic's persistent data structures are clever, but under the covers they must be
susceptible to the complexities discussed above in the Efficiency section.
Datomic being closed source, it's hard to find out exactly how it executes queries, but some
information has been gleaned from
forums and lectures. It seems that

To answer as-of query for moment T, current, in-memory and
history parts get merged, and then all data with timestamp after
moment T is ignored. Note that as-of queries do not require
older versions of current index, they use most recent current
index and filter it by time, deducing the previous view of the
database.

So we're implicitly making the assumption proposed above, of infrequent revisions. If revisions for the same
query parameters are rampant, then Datomic will scan over the set of all versions and "filter it by time",
which will, as in the SQL case, be dreadful.

To test this implementation realistically, we'll want to insert a lot of entries, and to do so in reasonable
batch sizes. A single transaction can of course encompass multiple operations:

Since I can't think of any more funny addresses, the test will involve random data.
In a single transaction, we'll insert things like nKeys different keys, of the form Thing${kKey};
we'll do this over nTv explicitly different times; and we'll repeat the exercise nTt times,
implicitly picking up increasing transaction times, which we then return. The values will be of
the form Thing${kKey}v${jTv}t${iTt} so we can easily verify later if they're correct.

Note that we've made explicit choices about how easy to make this for Datomic.
Remember that the index is going to be stored in an order that is first increasing in kKey
and then decreasing in jTv. For every (kKey,jTv), there will be a block of differing
transaction times. So our insertions are in a raster pattern:

Depending on our planned access patterns, we might want to restructure our index key someday, so that
temporally proximate insertions are spatially proximate as well. For example, if we were storing stock
prices that tended to arrive in increasing TV, we might put that part of the key first. Our decision
will reflect both our access patterns (which we may be able to know in advance) and Datomic's caching
strategy (which is a little less clear and, unfortunately, not directly observable).

For batches of 20 or so keys, on the free, local storage variant of Datomic, insertions stabilize at around
750-1000 microseconds. Random reads (note the assertion that we're finding the values we looked for)

stabilize at around 500-750 microseconds, even for data volumes in excess of JVM heap size.

In conclusion, Datomic is a very promising option for storing inhomogeneous bitemporal data, but we
must always think carefully about how are access will actually be implemented, to avoid cache thrashing.
The presentation of the database as a persistent data structure, with scrolling access to previous states,
naturally captures the TT dimension, while TV needs to be implemented schematically; the
assymetrical relationship between the two times means we can't do it the other way around.