Tuesday, April 19, 2016

First Normal Form in Theory and Practice Part 3

Note: This
is a 11/23/17 revision of Part 3 of a three-part series that replaced all
of my previous posts on the subject (pages of which redirect here), in
order to further tighten integration with the formalization and
interpretation [1] of McGoveran's formalization and interpretation [1] of Codd's true RDM.

Database Design Consistent with Intended Data Use

The Information Principle (IP) -- Rule 0 of the RDM [2] -- mandates all information in a relational database be represented in relations, explicitly and in exactly one way -- as values of attributes defined on domains. Database relations must be in 5NF -- defined in terms of attribute dependencies [3] -- which requires that they are in 1NF -- defined in terms of simple domains and value atomicity [4]. 1NF ensures that FOPL is sufficient for data sub-languages and they are declarative and decidable [5] and physical independence (PI) [6], 5NF guarantees logical validity, logical independence (LI) and semantic correctness [7].

No 1NF means no 5NF and no relations. Some information -- namely, attribute components meaningful to users -- is represented implicitly and, thus, hidden from the DBMS, in violation of the IP and the RDM. Application access to that information requires a SOPL based data language, with loss of the relational advantages.

One implication is that database designers should make sure that all object properties of interest to users are represented by attributes defined on simple domains and are not implicit in components thereof. Another implication is that it is impossible to ascertain whether a table is a R-table (i.e., if it displays a relation -- which must be in 1NF) by its sheer inspection. Going back to the table in Part 1,

(1) it has uniquely named columns and assuming that (2) ID represents the PK (3) no information is encoded in the order of rows and columns and (4) No values are missing, it is a R-table if and only if (5) NAME and ADDRESS are defined on simple domains and their components are not meaningful to users.

are often given as examples of non-R-tables on 1NF grounds. But while the PHONE1, PHONE2 design is poor, depending on how the domains of NAME and ADDRESS are defined, it may not violate 1NF. If they are both simple and have no components meaningful to users, the relation is in 1NF, otherwise it isn't. It's actually the missing values (and possibly the lack of a PK) that disqualify the table. Although Codd supported missing value "marks" -- not to be confused with SQL NULLs -- that is now considered a mistake (for a relational solution to missing data, see [8]).

Enforcing 1NF

1NF has very limited meaning outside true RDBMSs. In SQL -- erroneously accepted in the industry as a relational data sub-language (it is neither relational, nor just a data sub-language), 1NF is a heuristic rule that is easy to treat inconsistently.

Note: Primitive domains represent in the database abstract properties that are not observed directly, but are inferred from observed object properties. For example, people have names and addresses, from which a primitive domain VARCHAR(n), or more structured versions can be inferred, where the maximum n is vendor-defined. Second, there is no CREATE DOMAIN SIMPLE option in SQL. Moreover, if the designer defines, for example, an attribute on a SDT, SQL has built-in functions that can be used to subvert atomicity (e.g., YEAR() or MONTH()applicable to the primitive SDT DATE).

Note: According to McGoveran, such functions have two possible uses: (1) as a shorthand for a disjunction of values and (2) extraction of internal components to treat them as attributes in their own right.Both are useful, but we are sloppy about distinguishing between them. If (1) is intended, then it is consistent with 1NF.

"Consider the restriction name LIKE "Jo%". If, for example, the permissible values of the NAME domain that begin with Jo are John Smith and Joseph Jones, this restriction can be understood as just shorthand for name = "John Smith" OR name = "Joseph Jones". This interpretation is reasonable even when we don't explicitly know all the permissible values of the NAME domain. If (2) is intended, expressions get more verbose. For example, consider the comparison name = "John Smith". If both first and last name are independently meaningful and 1NF is enforced, we have to convert NAME to (FIRST, LAST) and the comparison becomes first = "John" AND last = "Smith".

When 1NF is not enforced, if a FOPL-based data sub-language allows application access to components, it is tantamount to design changes that can have broad and unintended effects. If, for example, NAME should suddenly be treated in some relation as (FIRST LAST), NAME should be split into two attributes. But What does that mean for the functional dependency (FD) between the primary key (PK) and NAME?What if it turns out that the FD was really LAST and the PK?What if NAME were referenced by a foreign key (FK) in some relation representing employee children?" --David McGoveran

Single-valued Attributes?

C. J. Date argues that value atomicity is not precisely definable and that a relation is in 1NF if it has single-valued attributes (SVA) (i.e., every attribute has a single domain value in every tuple). Thus, the table,

is purportedly not a R-table by Date's definition because PHONE is defined on a multi-valued attribute. To be a R-table, it would have to be defined on a SVA -- a simple RVD, with sets as values treated as atomic (e.g., {303-456-9933,303-456-9944}) by the data language.

Note: In which case PHONE values would not be union-compatible and the relation would not be amenable to normalization.

But "single-valuedness" is not an inherent property of the data, as is implicit in Date's position. Regardless of the corresponding domain's data type defined by the designer (e.g., RVD values that are sets of phone numbers, or a VARCHAR(n) domain's values that are concatenations of phone numbers, or values that are commalists of phone numbers), whether or not PHONE is a SVA is a function of whether it is consistent with its intended use (i.e., whether it has meaningful components) and if it does, it is not single-valued. That's how Codd's 1NF is defined.