Every table in a relational database requires a primary key - a unique identity for each occurrence (row). This primary key is obtained from one or more fields (columns) that exist within the table's structure, and it is the value contained within the primary key field(s) that is used to identify each individual occurrence.

For each primary key the database will build an index. This contains the primary key values and a pointer to the corresponding data record. When searching for entries via a primary key value it is quicker to scan the index than the main data table. The performance benefit from using an index is balanced by the amount of disk space required to hold it and the extra processing required to maintain it.

An important fact to note about primary keys is that in most databases once created the value cannot be changed. If a change is required the only option is to create a new occurrence with the new primary key, then delete the original occurrence with the old primary key.

I would think twice about giving an ordinary user the ability to change primary keys through an online transaction as this could lead to performance problems if it involved large volumes of data on related tables.

As an example of a primary key consider a table called COUNTRY which contains items of data such as NAME, CURRENCY and LANGUAGE. It would be possible to use NAME as the primary key, but it is usual to use a shortened code rather than a long textual description.

primary key

COUNTRY_ID

COUNTRY_NAME

CURRENCY

LANGUAGE

UK

United Kingdom

Sterling

English

USA

United States of America

US Dollar

English

FR

France

French Franc

French

For objects such as countries it is commonplace to use the internationally recognised standard country codes as the primary key.

A candidate key is a means of providing an additional unique key into a table. Candidate keys are entirely optional, so a table may contain none, one or several of them.

For example, in a payroll system the EMPLOYEE table may contain columns for PAYROLL_NO and NATIONAL_INSURANCE_NO, both of which are unique to each employee. If both of these were required as lookups into the table only one could be defined as the primary key, so the other would have to be defined as a candidate key.

Like primary keys each candidate key will require additional resources for indexing purposes.

The advantage of a candidate key over a primary key is that its value may be changed, although not to one that already exists.

Where two database entities are involved in a one-to-many relationship a column in the 'many/child' entity is linked to the primary key of the 'one/parent' entity. This column in the 'many/child' entity is referred to as a FOREIGN KEY, as shown is the following example:

primary key

foreign key

COMPANY_ID

COMPANY_NAME

COUNTRY_ID

MS

Microsoft

USA

IBM

International Business Machines

USA

ICL

International Computers Ltd

UK

A foreign key must exist between two tables before those tables can be the subjects of a JOIN in a SELECT statement.

You should not attempt to use a field as a foreign key where the corresponding field on the foreign table is not unique as this would violate the one-to-many principle. This could result in a child record having multiple parents. Although it is possible to link a foreign key to a candidate key this is not recommended as it is possible to change the value of a candidate key. As this does not automatically update the corresponding foreign key values in all child rows you would effectively break the relationship with those children.

This is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America.

This is a key for which the possible values have no obvious meaning to the user or the data. Technical primary keys let you enforce the uniqueness property within a table while making it easy to generate values invisibly and automatically. A typical method of generating values is to take the next number from a named sequence which is maintained by the database engine using code similar to the following:

SELECT <seq_name>.NEXTVAL FROM DUAL

In a database which does not have sequence numbers you could use something similar to the following:

SELECT max(id) FROM <tablename>

Remember that the result of this statement will get bigger and bigger as time goes on, so the field which holds the value must be large enough so as not to overflow in the foreseeable future. I have seen an interface definition of 'N10' used on more than one occasion for this purpose.

In some cases the columns within a table do not readily provide a unique value that can be used as a primary key. Take the example of a PERSON table which contains the following columns:

first_name, last_name, initials, sex, date_of_birth, address.

Which column, or combination of columns, could be guaranteed to provide a unique reference? Remember that a primary key, once assigned, cannot be changed. Once you exclude those columns that may be changed during a person's lifetime not much is left, certainly not enough to guarantee uniqueness.

In such a case as this the solution would be to create a new column (e.g. PERSON_ID) for use as either a semantic or technical key. If the value needed to be remembered for later use, such as a login id to the main computer system, then a semantic key may be more appropriate - a person is more liable to remember a short text string such as 'AJM' or 'TONY27' than a random 10 digit number.

It is possible to have a primary key that is comprised of more than 1 column. This is known as a COMPOUND or CONCATENATED key. If the number of columns required is too great, or the combined length is rather large, it is usual practice to insert a technical key. However, if you still need to create a compound candidate key to help prevent duplicate values then an additional technical key is superfluous.

Where a table has a semantic key it is not unknown for the user to want to change the value of that key. As explained previously this is just not possible with a primary key. Although the occurrence can be deleted and recreated with the new value this is no good if there are large numbers of occurrences on other tables which use the original value as a foreign key. The overhead of deleting and recreating all those occurrences could be enormous.

The solution here is to make the user-defined value into a candidate key and insert a new column for use as a technical primary key. Provided that all related occurrences are linked to the unchanging primary key then no relationship will be broken if the candidate key were to be changed.

There are some people who think that technical keys are such a good idea that they insist on putting them on every table without further thought and regardless of the consequences. Although there are situations where the use of technical keys can provide positive benefits, there are also situations where they may actually produce negligible or even negative results. The following sections identify several instances where I think that the inclusion of technical keys is ever-so-slightly questionable.

Where an entity already has a recognisable semantic key which is to be used as a foreign key in related entities there is no advantage in adding a technical key. Take the following example:

primary key

candidate key

COUNTRY_ID

COUNTRY_CODE

COUNTRY_NAME

1

UK

United Kingdom

2

USA

United States of America

3

FR

France

Where users are likely to be familiar with the semantic key values and have the ability to enter them directly they are unlikely to be as familiar with a random number. Everybody knows that a COUNTRY_CODE of 'USA' refers to the United States of America, but who can say the same for the number 2?

A relationship between two entities is defined within the application model as a one-to-many, where the number of occurrences of the 'many' entity may be zero, one or more. If a particular implementation requires that there never be more than one occurrence of the 'many' entity there is no apparent way that this restraint can be defined in the database schema. Take the following example:

'one' entity

primary key

PKEY_ONE

DATA

104

104 stuff

208

208 stuff

335

335 stuff

'many' entity

primary key

foreign key

PKEY_MANY

PKEY_ONE

DATA

27

104

more 104 stuff

35

208

more 208 stuff

56

335

more 335 stuff

This particular structure does not disallow the possibility of adding more than one occurrence of the 'many' entity for each occurrence of the 'one' entity. If software is released which does not enforce this non-obvious restriction the results can be, shall we say, interesting?

I have this (perverse?) expectation that if a database is designed in order to satisfy certain business rules, then it should be possible to reverse engineer the database design in order to recreate those rules. This is not possible with the above implementation. My preferred approach to this situation, one that I have used with complete success on more than one occasion, is to make the primary key of 'many' exactly the same as the primary key of 'one'. The primary key of 'many' can therefore serve a dual purpose as the foreign key back to 'one', as demonstrated in the following example:

'one' entity

primary key

PERSON_ID

DATA

104

104 stuff

208

208 stuff

335

335 stuff

'many' entity

primary key

foreign key

PERSON_ID

DATA

104

more 104 stuff

208

more 208 stuff

335

more 335 stuff

As the primary key and foreign key are now one and the same there is no need for a separate index on the foreign key. This saves both disk space and processing time.

This method makes it obvious to all concerned that only one occurrence of 'many' should exist for each occurrence of 'one', and indeed makes it impossible to do otherwise. However, it does upset some people because it violates their rule that each database table has its own number sequence for generating the primary key. As this rule does not exist in any database system that I have ever encountered I consider it to be artificial, and because it produces inferior results I choose to ignore it.

In this example the 'many' entity does not have a semantic key - the primary key is purely arbitrary as each occurrence is just one of many that may be owned by an occurrence of the 'one' entity. This is typically implemented using the following structure:

'many' entity

primary key

foreign key

PKEY_MANY

PKEY_ONE

DATA

27

104

stuff for 104,27

34

104

stuff for 104,34

35

208

stuff for 208,35

56

335

stuff for 335,56

Note the following disadvantages:

The foreign key will need to be indexed in order to efficiently retrieve those occurrences that belong to a particular value of PKEY_ONE.

There is no value in any field that could help to determine the identity of the previous or next occurrence for a particular value of PKEY_ONE.

There is no easy way to determine which is the first or last occurrence for a particular value of PKEY_ONE.

Now consider an alternative structure. Here the value for SEQ_NO is not unique for the table, but is unique for each value of PKEY_ONE. It is obtained by code similar to the following:

The size of SEQ_NO need not be as large as the technical key as it only covers a small number of occurrences, not the entire capacity of the table. Why use a 10 digit number when 4 will do?

SEQ_NO does not require a named sequence in the database, therefore you remove the possibility of releasing software from the development area to the live area and forgetting to create the named sequence on the live database.

As the foreign key forms the leading portion of the primary key it can share the same index, therefore there is no need to maintain an additional index for the foreign key.

The identity of the previous occurrence is always SEQ_NO of the current occurrence minus 1.

The identity of the next occurrence is always SEQ_NO of the current occurrence plus 1.

The identity of the first occurrence is always where SEQ_NO = 1.

The identity of the last occurrence is always where SEQ_NO = LAST_SEQ_NO on the 'one' entity.

An alternative way of finding out the current highest value for SEQ_NO would be to use a SELECT statement, as in the following example:

As the columns PKEY_ONE and SEQ_NO together form the primary key, this particular SELECT statement can be satisfied by a quick look at the index without the need to trawl through large numbers of data records.

In this example each occurrence of 'many' is only valid for a finite amount of time, therefore it must include a field to hold a date or time stamp. This is a structure that I have seen somebody else implement:

'many' entity

primary key

candidate key

foreign key

PKEY_MANY

PKEY_ONE

START_DATE

DATA

27

104

01-Jan-2000

stuff for 104

34

104

01-Nov-2000

stuff for 104

35

208

01-Jan-2000

stuff for 208

56

335

01-Jan-2000

stuff for 335

The candidate key is there (apparently) to ensure that an instance of PKEY_ONE does not have more than 1 occurrence for a particular date. However, the program still has to read in the occurrence with the highest date to ensure that the start date for a new occurrence is later, not earlier.

This structure allows entries to be created with dates in the future, therefore the procedure for identifying which occurrence is valid for a particular date is a little convoluted (and achieved with a database view):

Extract all occurrences except for those where START_DATE > $DATE.

Of the remainder extract the one with the highest value of START_DATE.

This structure allows START_DATE to be amended, but the following rules have to be obeyed:

New START_DATE must be greater than START_DATE of previous occurrence.

New START_DATE must be less than START_DATE of next occurrence.

Here is an alternative structure that I have implemented on numerous occasions:

Note that this will not work if undefined end dates are left as null. I always use a dummy date to simulate 'sometime in the future', as explained in Dealing with null End Dates.

This structure does require a little more effort to maintain, but I think the results are worth it. For example, when adding a new occurrence the following rules must be obeyed:

START_DATE must be greater than START_DATE of the previous occurrence.

END_DATE of the previous occurrence must be set to START_DATE minus 1 day.

When amending an existing occurrence the following rules apply:

The rules for START_DATE are the same as for new occurrences.

END_DATE must be less than END_DATE of the next occurrence.

START_DATE of the next occurrence must be set to END_DATE plus 1 day.

If you are being efficient and modern and using a 3 tier architecture then these rules can be defined in a single component in the business layer rather than being duplicated in numerous components in the presentation layer, but that's another story.

In this situation there are several groups of lookup codes, such as country codes, organisation types, address types, etc. Instead of having a separate database table for each group they are all lumped together on a single table with a group identifier to separate them. A big advantage of this structure is that new groups can be added very easily without having to amend the database structure. However, a disadvantage is that it makes the actual lookups a bit more complicated. Exactly how much more complicated depends on the inclusion of the all-pervasive technical primary key. Here is a structure that I saw somebody else design fairly recently:

This is a relationship between two entities (for example, X and Y), where more than one occurrence of Y can exist for each occurrence of X, and more than one occurrence of X can exist for each occurrence of Y. This can be represented in the following diagram:

Entity X

Entity Y

As it is not possible to define a many-to-many relationship directly the solution is to create an additional link or cross reference entity which can act as the 'many' entity in a pair of one-to-many relationships, as shown in the following diagram:

Entity X

LINK

Entity Y

This link entity requires as a minimum a foreign key for entity X and a foreign key for entity Y. If you insist on a technical primary key the resulting structure looks like this:

primary key

candidate key

foreign key

foreign key

PKEY_LINK

PKEY_X

PKEY_Y

1

X1

Y6

2

X9

Y42

3

X3

Y27

If there is not a unique key which combines the two foreign keys it will not be easy to check that a particular X+Y pair already exists, and without this check it will be possible to create duplicate entries. In the above structure this can only be implemented as a candidate key. If you think about it carefully you should see that all requirements could be satisfied by dropping the PKEY_LINK column and converting the candidate key into the primary key. This removes 1 redundant column and 1 redundant index, producing a simplified structure as follows:

This structure has the problems of a standard one-to-many relationship as highlighted previously, but is compounded by each additional level. Examine the following diagram:

Top

Middle

Bottom

Here there is a one-to-many relationship between 'top' and 'middle', and another one-to-many relationship between 'middle' and 'bottom'.

'top' entity

primary key

PKEY_TOP

DATA

104

104 data

208

208 data

335

335 data

'middle' entity

primary key

foreign key

PKEY_MIDDLE

PKEY_TOP

DATA

27

104

104,27 data

35

208

208,35 data

56

335

335,56 data

'bottom' entity

primary key

foreign key

PKEY_BOTTOM

PKEY_MIDDLE

DATA

1001

27

104,27,1001 data

1002

35

208,35,1002 data

1003

56

335,56,1003 data

With structure it will not be possible to satisfy the request 'select BOTTOM where PKEY_TOP = 104' without traversing through the MIDDLE entity. If there is more than one entity in the middle then this will add further steps in the journey. Now look at an alternative structure:

'middle' entity

primary key

foreign key

PKEY_TOP

MIDDLE_SEQ

DATA

104

1

104,1 data

104

2

104,2 data

208

1

208,1 data

56

1

335,1 data

'bottom' entity

primary key

foreign key

foreign key

PKEY_TOP

MIDDLE_SEQ

BOTTOM_SEQ

DATA

104

1

1

104,1,1 data

104

1

2

104,1,2 data

104

2

1

104,2,1 data

104

2

2

104,2,2 data

208

1

1

208,1,1 data

335

1

1

335,1,1 data

In this structure the request 'select BOTTOM where PKEY_TOP = 104' can be satisfied without the need to navigate through any intermediate tables, therefore would be quicker to process.

I once joined a project where some bright spark had the brilliant idea of giving every technical key the name 'ID'. When asked why he explained that when building the low-level component which actually retrieved records from the database it was not necessary to change the inherited code as the name of the primary key was always 'ID', and could therefore be hard coded.

This bright spark was obviously not an experienced programmer as he was unfamiliar with the various methods by which the names of the primary key field(s) can be obtained.

A problem begins to appear in relationships - if the primary key is called 'ID' then a foreign key cannot also be called 'ID', therefore something will have to change. The solution was to change the foreign key name to 'somethingelse_ID', where 'somethingelse' was an abbreviation of the entity name. Those of you with more than two brain cells to rub together will notice that this results in a field which is named one thing on one entity but something else on another. This is a practice that gives me the heebie-jeebies.

This becomes a problem when you attempt to use standard generic code to pass context from one component to another. A typical scenario that I use all the time is to have a component that lists multiple occurrences from table 'A', and within this component the user can select one particular occurrence and jump to another component which shows related occurrences from table 'B'. Using generic code it is a simple matter to extract the primary key of the occurrence selected in the first component and pass it as a string in the format "column='value'" to the second component which simply includes it as the WHERE clause in an sql SELECT statement. If the column names are different between the two tables then you have to interrupt the generic code with some sort of translation mechanism. If a table is involved in more than one relationship this will usually involve having to know where the "column='value'" string came from in order to carry out the correct translation. For example, if the string "id=209" can be passed from tables TOM, DICK and HARRY then you will need to know which one in order to translate "id" into "tom_id", "dick_id" or "harry_id". Just think of how much simpler it would be if the passed string could be used "as-is" without the need for any translation. It makes no difference to the database what the column names are called, but if program code has to be inserted to translate column names then you are introducing an area of delay, expense and potential error. If an idea involves cost but has no tangible benefit, then what is the point?

The convention that I have used for many years is to give each primary key field the name '<entname>_ID'. This means that when used as a foreign key the name does not have to change. Thus any field name that ends in '_ID' is a key field, and when appearing in the body of a field list it will not only be apparent that it is a foreign key, but will also identify the name of the entity to which it is related. Simple and effective. Regarding the above problem, if the field name does not have to be converted then it is possible to maximise the use of generic code without the need for constant translations.

The worst idea that I ever heard, though fortunately never saw implemented, was the idea that each technical key, instead of containing just a number, should also have a string prefix which identified the entity to which it related. I'm sure this sounds like a very clever idea, but what on earth is the benefit? I have always used techniques that have been simple and effective, so I do not see the point in switching to a new method that is less simple and less effective just because it is fashionable.

It has recently been suggested to me that my preference for prefixing "ID" with "<entityname>_" is not good practice as it results in redundancy. The argument is that in the statement "customer.customer_id" the word "customer" is used twice. This is "clearly redundant and is therefore not good practice".

In my opinion this argument is weak and short-sighted as it focuses on only one aspect of the usage of primary keys. In my many years of experience I have had to cater for other considerations. Some of these are:

I was brought up on the principle that if a field has a different context then it should have a different name. So if ID in one place means "id of customer" and ID in another place means "id of product" then it is incorrect to use the same name.

Some languages, when joining one table to another, will automatically include fields with the same name in that join. This will not produce the correct results in a system where unrelated fields have the same name.

Some RDBMS's (take MySQL for example) will allow you to specify a JOIN with "USING (fieldname)" instead of "ON (table1.fieldname=table2.fieldname)". This will not work if the same field exists with different names in different tables.

It also makes it difficult when passing a primary key in the format "fieldname=value" from one program to another. If the passed value is "ID=27" but the current table has several foreign keys which could qualify, then how do you know which one to use? This would require extra code to perform the translation between primary key name and foreign key name. My method avoids the need for any extra code.

There is no rule that says a primary key must always consist of a single field, therefore compound keys are permissible. Likewise there is no rule that says a primary key must always be a technical key, therefore semantic keys are also permissible. The trick is knowing when to use one form in preference to the other.

There are two ways of using technical primary keys - intelligently and indiscriminately.

Although the use of a technical primary key may avoid potential problems in some circumstances (refer to When using a technical key is a good idea) it is incorrect to assume that there will be benefits in all circumstances. Some people may argue that a single technical key is more efficient than a compound semantic key, but even if this were true (and I have yet to see any evidence in support of this claim) I would counter this by saying that a reduction in the number of additional candidate keys and indexes without a corresponding reduction in the accessibility of the data would actually produce an overall net saving.

If you create records with a technical primary but without checking for uniqueness with a semantic secondary key then it is possible to create records with duplicate data, and what effect would that have in your application? If you are able to check for uniqueness via a semantic secondary key then is the technical key redundant? Why have two unique keys if you can get by with one?

Some people say that you should use a numeric key instead of a character key as it makes lookups more efficient. What they fail to realise is that if you have to perform a lookup on the secondary key in order to obtain the primary key before you can perform lookups on other tables then any savings you make by having a numeric key have been flushed down the toilet.

It is my personal opinion that the indiscriminate use of technical keys shows a lack of thought and understanding, and leads me to wonder what other horrors are lurking in the database design. Putting a technical primary key on EVERY table by default may be fashionable, but it is not very intelligent. Where a technical key is employed unnecessarily valuable resources are actually being wasted:

The disk space used to hold the column data.

The disk space used to hold the index.

The processing overhead of maintaining the index.

The processing overhead of obtaining the next sequence number each time a new record is created.

Last but not least, the developer's time in creating and maintaining the program code.

If you have a database with hundreds of tables and millions of occurrences this could add up to a significant sum. Think about it. Look at the different ways in which an objective may be achieved, then examine the pros and cons of each one. Do not jump at a so-called 'solution' just because it solved a problem that appeared with a particular set of circumstances in the past. If those same circumstances do not exist in your current situation then the 'solution' may actually turn out to be more of a 'problem' instead.