The best invention in my Business Intelligence career is the adoption of the concept of having a LoadDate.

This should be a post to give credit to Dan Linstedt and thank him for inventing Data Vault and introducing LoadDate.

When I encountered the hub-and-spoke concept of Data Vault I thought, why on earth do I need a LoadDate? Combining in a Satellite the HubHashKey with the LoadDate to form the Primary Key makes sense. But why would I need it in a Hub or a Link? Just to know when a Hub entry was first introduced to the data warehouse? The usage is universal and far reaching!

What is the LoadDate? The definition is:

„The LoadDate is the date and time of insert for the record“

It is the moment when a record is first inserted into a Data Warehouse. It is not the extract date from the source or the current datetime in the source system. It is really the INSERT datetime in my Data Warehouse! The first time a record exists.

How to ingest it:

Save the value in a variable and insert it with the data set into the load table

Set a default constrain like SYSDATETIME() on the LoadDate column

What about when the insert takes a long time with a lot of data? Do I get a LoadDate for every second it loads? This would be very precise. In my world one LoadDate is one batch. One batch could be the load of one file. The next file has a new LoadDate. When loading data in a loop of something, each loop has a new LoadDate.

When the record has been inserted into the load table, the LoadDate stays with the record. Where ever it goes! Like a tracking marker.

The LoadDate is used in:

Load

Data Store

Hub

Link

Satellite

Dimension

Facts

I use it everywhere!

In a Data Store (aka Persistent Staging Area) I use it to order the data and batch-load the data to the target. Together with the SequenceNo it forms also the Primary Key.

In Link and Hub I know when the first appearance of a business key happened.

In Raw Vault Satellites I use it to load data incrementally from the Data Store.

In Business Vault Satellites where multiple source tables are joined together, which LoadDate succeeds? If history is important, I use it to create mini-Point-in-Time tables. If history is not important, I use the most recent record and calculate the MAX value of all LoadDates.

How can I use it in Dimensions & Facts? The main question in those objects is, when do I need to UPDATE columns with new data? When I was researching the best way, I used first to calculate a ChangeHash. Very time consuming! After detecting the formula above, I thought, why not use the LoadDate for change detection? If for a given Business Key the LoadDate changes, chances are high that the record needs an UPDATE.

With this approach I’m much faster now. And I’m also faster in preparing the data in a stage table as I can use the LoadDate for the delta loading layer too.

Are there any other similar concepts? In SQL Server one could use rowversion (altered from timestamp). It is a system-wide incremental number saved as a binary. But copying data into the next table will create a new rowversion except you extract the value and save it as a normal value. In my view not easy to handle and not readable. How do you know, when record 1234 was inserted?

Dan Linstedt: Thanks for introducing LoadDate to the Business Intelligence Community and forcing users to use it correctly. It is of tremendous help! I use it in batch loading, delta loading and change detection (only in Dimensions & Facts). I couldn’t do without it anymore. Thanks a lot!

One Comment

[…] If you are going to model your data lake or file repository to build data vault objects, this layer is a must have. You’ll want to preserve, when the data first appeared in your data lake. The definition of the load_date can be found here. […]