Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5

Subject: Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5 Sun 13 Apr - 20:06

We noted in Part 8 that Version Pattern 5 is equivalent to Version Pattern 3, but with the single episode constraint {SC 3-4} removed. However, in that same installment, we also pointed out a difference in the first semantic constraint for each pattern. Constraint {SC 3-1} states that "if a version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version." But for multi-episode objects, clearly, there may be any number of clock ticks between the logical delete version for one episode, and the initial version of the next episode. We accommodated this change by specifying that constraint {SC 5-1} has the same wording as {SC 3-1}, except for replacing "if a version for an object ..." with "if a non-terminal version for an object ..."

On examining constraints {SC 3-5} and {SC 3-6}, we found no differences. These constraints apply, without modification, to Pattern 5. This brings us to the last two constraints for Pattern 3. Do they also apply to Pattern 5? Except for the minor rewording of {SC 3-1} and dropping {SC 3-4}, are the semantic constraints for Pattern 5 exactly the same as those for Pattern 3?

Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns were like.

The Importance of Semantic Constraints Before we proceed, let's remind ourselves of the importance of explicitly listing the semantic constraints for each pattern. There are two reasons for doing so. The first is that, together with the schemas for each pattern - the columns used to implement the pattern - these constraints are the formal and fully explicit definition of each pattern.

The scenarios show how each pattern works. The diagrams provide useful visual representations of what is going on. But the constraints and the schemas are the patterns.

The second reason for explicitly listing semantic constraints is that we will have fully and correctly implemented a version pattern if and only if a) we include the versioning columns defined for the pattern in all tables we wish to version; b) we enforce each of these pattern-specific constraints; and c) for all version patterns, we implement what I will call "temporal referential integrity."

Entity integrity constraints - the uniqueness of primary keys - are still enforced by the DBMS. But, as we shall see later on, we cannot use the DBMS to enforce referential integrity. When standards committees stop squabbling and produce a temporal data management standard, and then when vendors implement that standard, it is to be hoped that "temporal referential integrity" will be enforced by the DBMS. But because we are still in the "do it yourself" period, as far as the evolution of temporal database management is concerned, that's what we will have to do. We will have to implement temporal referential integrity ourselves, and also the specific constraints that define each version pattern. That means, in most cases, that we will have to write code. In general, that code will be written as before-insert triggers.

But this is getting ahead of ourselves. For now, let's wrap up Version Pattern 5 by looking at its last two semantic constraints.

Semantic Constraints {SC 3-7 and 3-8}Constraints 5 and 6 apply to initial versions of objects. For Pattern 3, these are the versions that implement the insert of a new object. For Pattern 5, these are the versions that implement the insert of a new episode of an object. Inserting a new object is thus a special case of inserting a new episode of an object. It is the special case of inserting an initial episode of an object.

Constraint 5 applies to the effectivity begin date, and constraint 6 to the logical delete date. Constraints 7 and 8 apply to the same pair of dates, but implement versioned updates rather than versioned inserts.

{SC 3-7} states the constraint on the effectivity begin date. It says of any noninitial, non-logical-delete version of an object, that "its ver-beg-dt must be at least one clock tick past the ver-beg-dt of the immediately prior version for the policy. If it is not, then the two versions will overlap for at least one clock tick. But that is invalid because there cannot be two versions of the truth at any given point in time."

This constraint also applies to Pattern 5, and for exactly the same reason. However, in reviewing Pattern 3, we have discovered that constraint 7 is a corollary of constraint 1. Constraint 1 states that "If a (non-terminal) version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version." This is the same thing as saying that "... the begin date of the latter version must be one tick of the clock later than the end date for the former version." If it must be one tick of the clock later, then it follows that it "must be at least one tick of the clock" later.

So constraint 7, first of all, should have said "exactly" instead of "at least." But in either form, it is not a distinct constraint because it follows from constraint 1. If code implemented constraint 1, there would be no need for additional code to implement constraint 7.

{SC 3-8} states the constraint on the logical delete date. It says, of any non-initial, non-logical-delete version of an object, that "obj-end-dt must be null. If it is not, the policy is being deleted; and, by assumption, this is not a delete transaction."

Version Pattern 5 has the same semantic constraint. It says that the object end date for a version has a value if and only if its corresponding object is being logically deleted. Note that obj-end-dt is a business date, not a date reflecting activity against the DBMS. (In general, when we characterize a column as a "business" or "logical" or "semantic" column, we mean the same thing. More precisely (since these terms emphasize different things, and thus are not truly synonyms), every column described in these articles is either all three of these things, or none of them.)

Infrastructure is a Hard SellMore than one company has turned down the opportunity to implement versioned history as an enterprise resource, as a data management layer on top of all production databases, and on which a view layer can be built to hide its complexities. The attitude within IT seems to be that they need to get on with the job of building or re-building new OLTP systems, warehouses, data marts or ODSs (Operational Data Stores). If versioned history is required, they will just add a date to the primary key and be done with it. It works like this.

Versioned history is usually introduced during a project to build some kind of data warehouse, some kind of database for which there is a requirement to keep some amount of versioned history (as defined in Part 1). The next time a database is being created or modified and versioned history again becomes a requirement, it is implemented for that database. If a data mart is being built or modified, one based on the dimensional model, and there is a requirement to provide not only dimensions as they are currently, but also dimensions as they were at some point in the past, then work is done and, eventually, the result is achieved.

This approach works, more or less. As time becomes increasingly important in databases, it will be used more and more frequently. But it is the wrong approach, and results in a set of databases with (a) hampered semantic interoperability; (b) little or no flexibility to accommodate additional semantic temporal requirements; (c) high maintenance and enhancement costs; (d) a higher proportion of serious (but often difficult to detect) semantic errors in the data; and (e) a database whose time management functions must be understood and manipulated by both end user and IT-developer query writers.

Let's consider each of these flaws in the ad hoc approach, one at a time.

Hampered semantic interoperability. An ad hoc, one-database-at-a-time approach might be represented by a set of databases, each of which uses one of these version patterns. Obviously, federated queries that range across two or more such databases will have to reconcile the different implementations of versioned history. As we have seen, just by considering these first five version patterns, that reconciliation is likely to be difficult and error-prone.Restricted flexibility. Suppose one of these version patterns has been implemented in a database, and that later on, the business comes back and asks for additional versioned history functionality. This is equivalent to asking for a new Version Pattern. If this involves nothing more than rewriting insert triggers, the changes will be relatively easy to implement. If it involves adding non-primary key columns to a versioned table, the changes will still be relatively easy to implement.But the change from Pattern 2 to Pattern 3, and another change that we will consider later on, both involve changes to the primary key of the versioned tables. These changes are still easy to implement, compared to changes in primary keys for non-versioned tables. The reason, as we will see, is that referential integrity for versioned tables is not implemented by storing primary key values in foreign key columns. This means that changes to the primary keys of versioned tables affect those tables only and do not "ripple out" to affect all foreign key-related tables.

Nonetheless, the costs of incrementally adding temporal functionality are not negligible. Whenever the schemas are changed, for example, all code that touches those schemas must be altered. Although, as described in Part 1, only view-creating and physical update code will directly access versioned tables, this can still be a significant burden when changes must be made. Far better to implement an "ultimate" versioned history solution, and then "turn on" specific functionality as and when it is requested.

High maintenance and enhancement costs. High maintenance costs result from the need to maintain different version-implementing codebases. High enhancement costs are just a corollary of the previous point, restricted flexibility.The main source of high enhancement costs is the cost of adding a date column to the primary key of each table being converted to a version table, together with the (erroneous) belief that foreign keys to the table being converted must also be changed. With primary key expansion, and the belief that foreign keys must be maintained when a table is converted to a versioned tables, foreign keys in all directly related objects must also be expanded. Thus, the cost of change is no longer simply the cost of adding a date to a primary key, in a single table. Furthermore, when foreign keys to the original table are themselves part of the primary keys in dependent tables, the need to modify foreign keys ripples out to the tables dependent on those tables. This ripple effect can continue indefinitely, and require changes to any number of tables. Another problem is that complex "trickle" processes must be developed to replicate changes in versions and to replicate terminating events to dependent objects.

Risk of errors. This is a special case of the previous point, high maintenance and enhancement costs.Risk of Misinterpretation of Query Results. In addition, there is the specific risk of errors in interpretation. Because of the complexity of versioned history, it is quite easy to write queries that return apparently valid results, but that are actually answers to a slightly different question than the one the query author had in mind. This risk of misinterpreting query results exists for all queries, of course. But we have found that it is an especially high risk when temporal queries are involved.The Biggest Risk: Internal Machinery Exposed to Query Writers with the Ad Hoc Approach Temporal management functions are infrastructure. It's just that standards committees can't agree on how to implement them, and so DBMS vendors hold back. This does not mean that temporal management functions should be developed on the fly, one database at a time. It means that they are infrastructure we must currently develop ourselves.

Although not necessarily so, every ad hoc implementation of versioned history that we have seen exposes the temporal management infrastructure to both developers and business users of the databases. It also asks modelers and developers to design and build perhaps the most complex semantic patterns that databases are asked to manage. And the ability to do this well, let alone the ability to do it correctly, is found only with the most advanced modelers and developers an IT department has. For those IT managers who think that modeling and database development are commodity skills, the management of temporal data is the strongest disproof possible.

Because temporal data management is infrastructure, it should be encapsulated within a data access layer that hides its complexity from modelers, developers and business users alike, as explained in Part 1. Just as designers and developers don't have to implement balanced-tree access methods, but make use of them every time they design and build a relational database, so too should they be able to use temporal data management without the internal machinery being exposed to them.

Infrastructure projects are always a hard sell. By definition, they don't satisfy any particular need of a specific business unit. They are paradigmatically enterprise work, work whose value is realized indirectly by the specific applications and databases that serve specific business users and purposes. The management of time with today's relational DBMS products is a "do it yourself" effort. For the reasons just articulated, it is work whose considerable cost should be born by the enterprise.

One final note. We have noticed that when infrastructure work is proposed, and objections are raised to it, at some point someone will say that what we are proposing is a "build it and they will come" approach. And, of course, everyone knows that that is a bad idea. But with infrastructure work, the only alternative to "Build it and they will come" is "get in, patch it, and get out as quickly and cheaply as possible."

We believe that when an enterprise solution to temporal database management is built, they will come. It is not the business that will come. But the business will make demands on developers that require temporal data management, and those developers will come.

Earlier, in the numbered paragraph entitled "High Maintenance and Enhancement Costs," we suggested that foreign keys don't work with versioned tables the same way they work with "normal" unversioned ones. It is important to be very clear about this. A common misunderstanding of versioning is that when a new version of an object is created, all foreign keys that pointed to the original version must be cascade-updated to point to the new one. Indeed, were it true that foreign keys work the same way in versioned and un-versioned tables, the problems would get even more complicated.

So next time, in Part 10, we will stop and examine the use of foreign keys when versioned tables are involved. We will see that DBMS-enforced referential integrity cannot be used. A new way of using foreign keys and enforcing referential integrity is required. And once again, because temporal database management is on hold as far as DBMS vendors are concerned, we will have to take a "do it yourself" approach.

Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5