Sunday, April 15, 2018

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection

Note: This the first of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

"The Internet is full of dogmatic commandments for choosing and using keys in relational databases. At times it verges on a holy war: should you use natural or artificial keys? Auto-incrementing integers, UUIDs? After wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow I think I’ve put the pieces together and have a recommendation to harmonize the various camps. Many arguments about keys boil down to false dichotomies and failures to acknowledge other points of view."

--Joe Nelson, begriffs.com

As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL), adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL. Dogma and holy war are products of ignorance. What Nelson did "piece together" from "sixty-four articles, five books and asking questions" is conventional wisdom, which cannot produce understanding because it has been off for decades.

I have been using the proceeds from
my monthly blog @AllAnalytics to maintain DBDebunk and keep it free.
Unfortunately, AllAnalytics has been discontinued. I appeal to my readers,
particularly regular ones: If you deem this site worthy of continuing, please
support its upkeep. A regular monthly contribution will ensure this unique
material unavailable anywhere else will continue to be free. A generous
reader has offered to match all contributions, so let's take advantage of his
generosity. Purchasing my papers and books will also help. Thank you.

To work around Blogger limitations, the labels are mostly
abbreviations or acronyms of the terms listed on the FUNDAMENTALS
page. For detailed instructions on how to understand and use the
labels in conjunction with the FUNDAMENTALS
page, see the ABOUT
page. The 2017 and 2016 posts, including earlier posts rewritten in 2017
are relabeled. As other older posts are rewritten, they will also be
relabeled, but in the meantime, use Blogger search for them.

Following the discontinuation of AllAnalytics, the links to my
columns there no longer work. I moved the 2017 columns to dbdebunk and,
time permitting, may gradually move all of them. Within the columns, only
the links to sources external to AllAnalytics work.

Primary Keys

Except for foreign keys (FK), in his original papers Codd introduced only one kind of relational keys -- primary keys (PK)[3]:

"Normally, one [attribute] (or combination of [attributes]) of a given relation has values which uniquely identify each element (n-tuple) of that relation. Such a[n attribute] (or combination) is called a primary key. In the example above, part number would be a primary key, while part color would not be."

(slightly edited to reflect current knowledge). Codd's writings are dense, and there are several important aspects in this paragraph, as well as in his other writings, that have been missed or misunderstood.

Here I will just point them out (for an in-depth discussion of keys see[2]).

Formal Mandate

"Are PKs mandatory or optional?" is one of the most frequent questions in database practice. Because Codd mandated a PK for every relation without giving any formal motivation, the focus was on "arbitrary" PK selection, which led Date to question the mandate: a PK is practically a good idea, but lacks a theoretical basis, so it's candidate keys (CK) that are essential[4], which has been triggering the question ever since.

PKs do have practical implications -- both users and the DBMS do a lot more work and use a lot more storage to translate among multiple CKs -- but they are mandatory for a theoretical reason: FOPL formally requires a PK, not just a bunch of CKs, for every relation. In fact, the RDM requires not only that one CK be designated PK, but places several formal requirements on it. For example, a careful reading of Codd's paragraph above, in conjunction with a later 1979 paper[5], reveals that FOPL also requires that PKs:

Represent names, not properties; and,

Are irreducible in current parlance.

Pragmatic Selection

The formal PK mandate requires that one name CK be selected PK, but the following Codd statement triggered uncertainty about PK necessity

"Whenever a relation has two or more non-redundant primary keys, one of them is arbitrarily selected and called THE primary key of that relation."

Poor use of language. By "arbitrary" Codd meant that when there are more than one CK, all of which satisfy the formal PK requirements, the choice is pragmatic (i.e., driven by two practical considerations outside relational theory, and sometimes databases-specific tradeoffs are necessary[3].

Note: "Non-redundant" is irreducible in current parlance.If there isn't any simple name CK, a surrogate key (SK) should be generated that

Has a 1:1 value relationship with some CK that represents a name, or set of properties used in the real world to identify entities[3]; and,

Is managed by the DBMS transparent to users;

Keys representing names used to identify entities in the real world are more familiar to users than SKs, SKs provide, which is why property CKs must also be enforced to ensure meaningful logical database access.

6 comments:

When you say: "If there isn't any simple name CK, a surrogate key (SK) should be generated that has a 1:1 value relationship with some CK that represents a name, or set of properties used in the real world to identify entities" - does that not contradict the requirement that PK represent names, and not properties? How does representing a set of properties satisfy the requirement that the PK"Represent names, not properties" and "irreducible"? If the set of properties is namable, then shouldn't they just be your PK?Another question on that is, if a relation is composed of unique tuples, how is it that the full tuple is not itself a CK?

It would seem to me that the reasons for using SK over a composite PK is implementation-oriented:- PKs composed of multiple fields that have varying degrees of distinctness can result in poor indexes and poor statistics- A surrogate key, even where there is a perfectly valid CKs available, can aid in composing queries without the user having to investigate what fields compose the PK- In the cases of exceptionally large systems, use of SKs can avoid hitting limits on maximum column counts, etc.

Are these valid reasons to favor using SKs over valid CKs in general? Are there other reasons that you might want to avoid using a CK in favor of an SK that are not related to lack of identifying (name) fields?

Thanks for your great work on all of these articles, I really appreciate it.

You need to read the paper to which the article refers -- there is a subtle theoretical reason for the requirement that PKs be names, not properties. But properties can be AKs. So if you don't have a simple name CK and you need to generate a SK, you need to ensure that it has 1:1 value relationship with a guaranteed identifier, which can be the AK. Otherwise the SK cannot ensure entity integrity.

That is the case in industry practice (see Part 3), but it should ABSOLUTELY NOT BE! Keys are an element of the logical model and not of physical implementation. You may have to bstardize your model because current systems cannot handle correct models, but you should be conscious of this and not blame the model for the poor implementations.

Pls note that generated SKs must be managed by the DBMS transparently to users. They are not the SKs that are generated with SQL systems.

I would phrase it otherwise. The 1:1 cardinality between ck and *some* unique attribute is what makes the primary key either a surrogate (computer generated) or a natural key. There is no such thing as a natural key. Any natural key is priorly a surrogate key.

You need to read the full paper. Natural key is an industry, not relational term.

FOPL requires a PK to be a name, not a property. If there is a simple CK, it is designated PK. If not, a SK is generated, BUT (1) it must be managed exclusively and transparently by the DBMS and (2) it must have a 1:1 with some non-simple CK.