Virtual versioning for directories (database)

This article describes news easy algorithm for data versioning in databases. It is similar to Log trigger, but it is more universal and reliable. If this method has already been invented, I have not found it in the public domain.

How it all began?

In one project, we were tasked with managing and distributing many directories, sometimes over unstable communication channels and with a chance to roll back time on servers. I'll be glad to share my experience in this area with you.For technical and political reasons, technical and political reasons we didn't use an existing software program. There are no analyses of existing systems in this article. The project was made using language C# and MS SQL Server.

We used architecture "Master-Slave". One Master system contained a reference sample of directories, and enabled our client to edit the directories using the GUI or file import. Slave systems pulled the Master system for updates periodically. Data exchange was performed with WebAPI and JSON, because clients ranged from Windows Server to Android. There were about 10,000 Slave systems. They pulled the Master system about once per hour. It's not a resource-intensive process; nonetheless we also used file-caches and a system of patches to run the Master system on very weak servers.

We had different directories:

several related tables, each containing anywhere from 10000 to 1000000 records;

several versions of a directory can be active (tariffs for past, present and future); every version has own dates of action;

the Master system had to store all versions for 5 years; Slave systems can store versions for a year;

several orders of directories their have own dates of action;

a new version of a directory can be prepared in a few weeks by several people performing intermediate tests. Uncompleted versions must not

be available for Slave systems;

a new version can contain from one to one million new orders.

A special kind of directory is the Registry (e.g. a registry of cash machines):

a registry consists of one table, but it can relate to other dictionaries;

only the latest version is active;

usually a new version of a registry contains only one new order.

We had developed common method of storage and distribution for directories an registries. It's lightweight, and you can use it even if your data structures are easier.

We developed a common method of storage and distribution for directories and registries. It's lightweight, and you can use it even if your data structures are simpler.

Problems

The first question: what is version number? There are only a few variants: date and time of the directory's editing and an integer number (auto-increment). "DateTime" is a very popular variant, but it has some drawbacks:

the server time can be rolled back;

the computer clock times for the master and slave systems can be different. Therefore, it's necessary to work out the logic of the directory's transfer very carefully.

Accordingly, we chose an integer number for the versioning process, as a more reliable variant.

The second question: how to store several versions in a database with quick access and without data duplication? There are more variants:

Storing a full copy for every version, but this is redundant - two versions can differ only by one record;

Storing one full last copy and all changes from previous versions. This method saves memory, but complicates access to previous versions;

A hybrid method: storing several full copies and changes;

The method called "virtual versioning" described below.

Solution

A table with information about versions:

Column

Description

versionNumber

A number of a version, integer, auto-increment, no missingvalues.

useFrom

An optional field. Start date and time for the version.

status

An optional field. Status of the version: e.g. editing, approving, testing, distributing - any statuses for your business process.

...

User name, date-time of editing and etc.

A table with data:

Column

Description

ID

Unique identifier of a record, according to your business process.

...

Fields with data.

addInVersion

The number of the version, where the record was added.

deleteInVersion

The number of the version, where the record was deleted.

This structure can be used in both the master and the slave systems. Master and slave systems store a different number of directory versions. A record in a “Data” table belonged to several versions; that is why the method was called, "virtual versioning".

Filling directory

When using this method, you must create a new “version” in the table "Versions" if you want to edit the directory. After that you can edit your "Data” tables:

If you are adding a new record in a directory, you must set the field “addInVersion” as "current version number", and the field “deleteInVersion” as NULL;

If you are deleting a record from the directory, you must set the field “deleteInVersion” as "current version number". Don't delete the record from the table;

If you are editing a record, you must set the field “deleteInVersion” as "current version number" for the existing record, and create a new record with new values and the field “addInVersion = current version number" (i.e. the editing operation is replaced by two operations: deleting and adding in one transaction).

For example:

Name

Sex

addInVersion

deleteInVersion

Kate

female

1

NULL

Tom

male

1

3

Lisa

female

1

2

Tom

female

3

NULL

The record "Kate" was created in version 1 and it exists in all other versions.

The record "Lisa" was created in version 1 and it was deleted in version 2. (i.e. in version 2 this record doesn't exist anymore).

The record "Tom" was created in version 1 and then it was edited in version 3 (in ver. 3 it is "female" already).

One record in the “data” table belonged to several versions: from “addInVersion” (inclusively) to “deleteInVersion-1”. The unique identifier for every record in the data table is “ID + addInVersion” (composite key).

If a new version of a directory is created over a long period (of time) with a lot of editing of the same record, then the algorithm becomes as follows:

Create a new record in the table "Versions", “versionNumber = N”, “status = ‘Editing’”. Previous version must have this status: “status = distributing";

Edit “data” table (or tables) according to the following rules:

deleting records:

if “addInVersion” not equal N, then set for the record "deleteInVersion" as N;

if "addInVersion" equal N, then delete the record from the table;

undo-delete records:

if "deleteInVersion" equal N, then set "deleteInVersion" as NULL;

if "deleteInVersion" not equal N, then operation isn't possible, but you can create a new, similar record;

adding records:

add new record in the “data” table, set the fields “addInVersion” as N and "deleteInVersion" as NULL;

editing records:

if "addInVersion" equal N, then change the record's fields without changing "addInVersion" and "deleteInVersion";

if "addInVersion" not equal N, then set "deleteInVersion" as N, and create a new record with "addInVersion = N" and "deleteInVersion = NULL" (standard editing);

If editing is finished, set in the table "Versions" field status as "distributing";

Slave systems receive only versions with the status "distributing".

You can always safely remove the version with status "editing": simply delete all records with “addInVersion = N” and set “deleteInVersion” as “NULL” for all records, where “deleteInVersion” set as “N”. The drawback of this method is that the history of interim changes isn’t saved, but you can use an additional journal of user actions instead. The advantage of the method is that interim changes won't be delivered to the slave systems.

Several related tables

If your directory consists of several tables and you want shared versioning for them, add the fields “addInVersion” and “deleteInVersion” to all of the “data” tables and then designate a new table as the controlling "Versions" table. Example:

Table Users:

Name

SexID

addInVersion

deleteInVersion

Kate

1

1

NULL

Tom

2

1

3

Lisa

1

1

2

Tom

3

3

NULL

Table Sex:

SexID

Sex

addInVersion

deleteInVersion

1

female

1

NULL

2

male

1

NULL

3

other

3

NULL

In this example, two tables were edited in version “3”. Data consistency won't be broken on thea slave system side.

IMPORTANT: the field “SexID” in the table “Sex” isn't unique - it's not “Primary Key”. In this case the “Primary Key” is a composite: “SexID + addInVersion”. In the table “Users”, the field “SexID” isn't the “Foreign Key”, (i.e. DBMS doesn't control referential integrity for data with versioning). It's impermissible to use the composite foreign key “SexID + addInVersion” in the table “Users”: if a record in the table “Sex” is edited, the composite foreign key in the table “Users” will continue to relate to the record from an older / previous version. A surrogate key isn't the solution either. When creating new version, your software must control referential integrity.

The process of adding and editing records remains the same as described above: when editing link-fields, it's necessary to check that the existing related record contains “deleteInVersion = NULL” (i.e. confirm that it was not deleted from the directory earlier).

Deleting records: when a record is marked as deleted (i.e. setting “deleteInVersion” as "Version number"), it's necessary to check other records which could refer to this record. An example with an error:

Table Users:

Name

SexID

addInVersion

deleteInVersion

Kate

1

1

NULL

Tom

2

1

3

Lisa

1

1

2

Tom

3

3

NULL

Table Sex:

SexID

Sex

addInVersion

deleteInVersion

1

female

1

NULL

2

male

1

NULL

3

other

3

4

Version 4 was created; the record containing "other" was deleted. The DBMS won’t return an error because the field “SexID” continues to relate to “SexID=3”, but your program will crash.

An example without error:

Table Users:

Name

SexID

addInVersion

deleteInVersion

Kate

1

1

NULL

Tom

2

1

3

Lisa

1

1

2

Tom

3

3

4

Table Sex:

SexID

Sex

addInVersion

deleteInVersion

1

female

1

NULL

2

male

1

NULL

3

other

3

4

The record "Tom" was deleted together with the record "other". Optional method: the record "Tom" could be edited in version 4 by assigning a new value to the field “SexID”. But apparently Tom did not have time to decide. :)

If you have several independent tables in your project and you want to use versioning for them, you can add several tables with version numbers, each of which is related to one or several data tables.

There are pitfalls with related tables: We had some problems with related tables: for example, we decided to edit the table Sex, and this process takes a long time, as it requires coordination with many people. Create Version 4; set “Status” as "approving"; edit the table “Sex”; wait for completion. Suddenly comes a command to edit table “Users”. Version 4 was / has been created already; just make an edit in the table “Users” (“addInVersion = 4”). But slave systems can't receive these changes until full version 4 is approved, and that can take a very long time.

Solutions:

halt or accelerate workflow;

roll back all changes in the table “Sex”, make changes in table “Users”, close version 4 (“status = distributing”), create version 5 and make changes in table “Sex” again;

divide versioning for the tables, but the (there is a risk that the) directory could be damaged on the slave-server side.

Cleaning of Outdated Versions

Outdated versions can be quickly deleted from the master and slave databases.

Delete a record from the table "Versions", where the version number is minimal (for example, “M”);

Delete records from data tables, where “deleteInVersion = M”;

Change records in data tables, where “addInVersion = M”: set the field “addInVersion” as “NULL”, if you want to preserve referential integrity in the database. Another solution is to leave the records unchanged.

You can delete only the oldest version (i.e. it is strongly discouraged to delete version 2, if version 1 exists). The retention period of versions can be different for the master and slave systems. In our project the master system stored directories for about 5 years; the slave systems, for about a year. You can also store only the latest version on the slave system, if necessary.

Selecting Data from Directory

All requests are quite simple and don't require complex calculations. First, select the version number you want to work on from the table "Versions". It can depend on the current date and time, or it can be set to force, according to your specific business process.

Search records in a directory with version X:

SELECT Name, Sex FROM Users WHERE Name = 'Kate' AND (addInVersion <= X OR addInVersion is NULL) AND (deleteInVersion > X OR deleteInVersion is NULL)

Request all records from a directory with version X:

SELECT Name, Sex FROM Users WHERE (addInVersion <= X OR addInVersion is NULL) AND (deleteInVersion > X OR deleteInVersion is
NULL)

Request all records from a directory with version X or earlier:

SELECT * FROM Users WHERE deleteInVersion >= X OR deleteInVersion is NULL

You can use this request for a slave system initialization:

If you need to create a diff-file for versions between X and Y (Y = X + 1), then make two requests:

1) Request all records which must be added:

SELECT Name, Sex, addInVersion FROM Users WHERE addInVersion=Y

2) Request all records which must be marked as deleted:

SELECT * FROM Users WHERE deleteInVersion = Y

Now you can save this data in a file (for example, JSON).

Data Transfer from Master to Slave Systems

You can use any method for data transfer. We used the following:

After a new version was created, diff-file (JSON) began preparing and uploading on a network storage (web);

The slave system sends its own version to the master system;

The master system checks whether (or not) there is a new version for the slave system;

If a new version exists (one or several), the master system sends a link to the diff-file in the response;