Surrogate Key vs. Natural Key

Readers frequently ask me whether they should use a natural value or a surrogate identifier for a table's primary key. A surrogate key is an artificially produced value, most often a system-managed, incrementing counter whose values can range from 1 to n, where n represents a table's maximum number of rows. In SQL Server, you create a surrogate key by assigning an identity property to a column that has a number data type. A natural key is a naturally occurring descriptor of the data and one of a table's attributes that has no duplicate values. When you use a natural key as a table's primary key, each of the table's rows is uniquely identified. I addressed this concept in SQL by Design, "How to Choose a Primary Key," April 1999, but sometimes a topic is important enough to revisit.

Data modelers (for this discussion, I include anyone who has designed tables for a database) are divided on this question: Some modelers swear by the surrogate key; others would die before they used anything but a natural key. A search of the literature about data modeling and database design supports neither side except in the data warehouse arena, in which a surrogate key is the only choice for both dimension and fact tables. That lack of consensus leaves those of us in the transactional database world to wrestle with the question of which key makes a better primary key—a natural key or a surrogate key. The ultimate answer is a resounding "It depends." Whether you choose a surrogate key or a natural key depends on your data, your database platform, and the group that wields more power in your organization—the DBA group or the applications development team. Let's look at the way each of these factors can influence the choice of a primary key.

The Data

To help you decide whether a potential primary key is appropriate, I recommend that you ask the questions that point to a primary key's four criteria:

Is the primary key unique?

Does it apply to all rows?

Is it minimal?

Is it stable over time?

If the answer to any of these questions is no, your candidate fails to meet the four criteria and you need to look for an alternative. Each user table should have a designated primary key, which you create in a CREATE TABLE statement or in an ALTER TABLE statement. Although SQL Server assigns a row identifier (RID) to each record in a file, users and user programs can't use an RID as a uniqueidentifier, because you can't guarantee that an RID will remain constant over time. Therefore, an RID doesn't meet the criterion that a primary key's value must never change. An RID is composed of a file number, a page number, and a page's row number. As the position of each record shifts in the file, the record's associated RID changes. The primary key you choose must have a constant and unchanging set of values. When you assign a primary key value to a record, the value must not change for the life of that record and you can't reuse the value—even after the record is deleted from the table. In addition to these logical reasons for not using an RID as a primary key, you can't access it through any supported programming interface.

We know that a primary key needs to be a uniqueidentifier for each row in a table, but basically, the primary key has to apply to all rows. For example, designating the ISBN as the uniqueidentifier for a table of published titles wouldn't make sense if you're collecting both new and old books and the old books were published before the ISBN system was established. Making sure that the primary key applies to all rows becomes especially important when you use a concatenated primary key—a key that's composed of two or more columns.

Let's illustrate the need to make sure that the primary key applies to all rows by looking at a retail store example that uses the customer ID and date of purchase as a Purchase table's primary key. Figure 1 shows this primary key in an entity relationship diagram (ERD). As long as you remember to enter sales transactions during the day the sales were made, your database contains accurate information. But the first time you fail to enter transactions on the day sales were made (maybe you wanted to leave the store early on the weekend, thinking that you would catch up and enter the weekend sales on Monday), you have to guess about the date a sale occurred. Now you have almost-accurate data in your database. Although "close enough" is acceptable data integrity for some organizations, it clearly falls short for others. If, as a result of choosing your primary key unwisely, you discover that you're missing data for part of the primary key, you might have to choose one of the three following courses of action:

create dummy data

enter almost-accurate data

exclude the event from the database

Not one of these three options is a suitable alternative to storing accurate, timely data in your database.

The Database Platform

The kind of database platform you work from can help determine whether you need to use a natural value or a surrogate identifier for your primary key. The criterion that a primary key must be minimal means the fewer the columns, the better. A single-column primary key simplifies data storage, retrieval, and coding. SQL Server automatically creates a unique index (and, by default, a clustered index) on any column designated as a primary key. In Figure 1's Purchase table example, if the primary key were CustID plus PurchaseDate, SQL Server would create a 12-byte index key. If you created a surrogate key with an integer data type for the Purchase table's primary key, the index key would be one-third the size. Eight bytes here and 8 bytes there don't sound like much until you start calculating the additional I/O required for reading and writing hundreds of thousands of records. If the records are short, you can fit more records on a page that's being read into memory or written to disk. Here's a general rule: The more records that can fit on a page, the better the performance.

SQL Server 2000 and 7.0 provide additional support for a brief primary key in the way they build indexes. In SQL Server 6.5 and earlier, the nonclustered indexes use row pointers to point to the actual data record. So, the leaf level in a SQL Server 6.5 nonclustered index contains index keys and associated data-page or row IDs that point to the referenced data row in the user table. Each entry in a nonclustered index page is approximately the length of the nonclustered index key plus 6 bytes for the page or row pointer.

The nonclustered indexing model changed in SQL Server 7.0. Now, in SQL Server 2000 and 7.0, for each clustered table that also has nonclustered indexes, the nonclustered index key and the relevant clustered index key reside at the leaf level of the nonclustered index. As Figure 2 shows, the nonclustered index points to the clustered index. As a result, if the clustered index key is long, the nonclustered indexes in SQL Server 2000 and 7.0 can be very large, thereby increasing your database storage requirements and possibly decreasing database performance. This negative effect occurs if you decide to cluster on the primary key and you choose a long character or variable-length character field (or worse yet, a concatenation of several fields) for the primary key. In the SQL Server world, if you're planning to cluster by the primary key, you're better off with short, single-column primary keys. For more information about choosing a primary key's data type, see "Make It Short and Sweet."

The Applications Development Group

In organizations in which database development is driven by a group whose primary concern is application development, a natural key is considered the only key to use for a table's primary key. Natural keys are meaningful; hence, they're easy to remember and easy to write code around. This group is making one of the most common mistakes that data modelers make when selecting a primary key—confusing a search key with a primary key. The primary key's purpose is to be used internally for distinguishing one row from another. You can use the primary key as a search key, but using a natural key for searching is usually easier. For example, a Customer table's primary key might be called CustID, which is a 16-digit surrogate key. Although CustID uniquely identifies each row in the Customer table, you don't expect customers to be able to identify themselves by their customer ID. Instead, customers give you their name, phone number, and perhaps a residence or mailing address. This combination of attributes is a search key.

So what's the answer? Is the surrogate key the better choice for primary key? By definition, the surrogate key meets all the criteria for a good primary key. Can you ever use a natural key—or a derivative of the natural key—for a table's primary key? If so, under what circumstances can you use it? Figure 3 shows a scenario in which a natural key fits comfortably as a primary key. This scenario showcases a Purchase model that contains an interesting variation: The PRODUCT_CATEGORY table has a primary key called CategoryCode, which is a 3-byte character data type. PRODUCT_CATEGORY is a reference table (also called a lookup table or a list-of-values table). You use reference tables to modify and describe other database tables. You can enforce referential integrity between the reference table and the table it's modifying, and in so doing, you help maintain domain integrity.

Often, the reference table contains only one column that carries a descriptor value. In contrast, the PRODUCT_CATEGORY table consists of two columns. The CategoryLongName column contains the long name of the category, which is the descriptor value. The CategoryCode column is a derivative of the first column; CategoryCode contains a manufactured code that represents the corresponding long name. For instance, a category long name might be vegetables, but the category code would be VEG. DBAs or data stewards create these category codes. A data steward knows and understands the data stored in the database and constantly monitors the data integrity. Because each code is a derivative of its descriptor, the code inherits the definition and meaning of its associated descriptor value.

In the PRODUCT_CATEGORY table, CategoryLongName is the natural key and CategoryCode is a derivative of the category long name. Although you might hesitate to use the category long name as a primary key (and understandably so, because the category long name is a variable character data type with a maximum length of 50 characters), the category code would be an acceptable candidate for the job. It satisfies all the requirements for a primary key. Because the category code is the only candidate, it becomes the primary key.

Consider the criterion that a primary key value must never change. If one day you decided to differentiate between fresh and frozen vegetables, you could create two new codes that would distinguish fresh veggies, VFR, from frozen veggies, VFZ, while leaving the old code, VEG, intact. You wouldn't have to change any historical records because before you named the two new categories of vegetables, all veggies were classified the same. Fortunately, SQL Server enforces the criterion that a primary key must be a unique identifier for each row in a table by automatically creating a unique index on the primary key column. Without this support from SQL Server, you'd have to confirm in your programs that this code hadn't been used before. What about the criterion that a primary key has to apply to all rows in a table? In this case, because the columns in this table are a descriptor and a derivative of the descriptor, each natural key will always have a value. As for the criterion that a primary key must be minimal, the derivative, CategoryCode, is a character(3) data type. You can't get much more minimal than that.

Apply the Criteria

When people ask me which makes a better primary key—a natural value or a surrogate identifier—I answer that the choice depends on the situation. In all the modeling work that I've done, I've rarely been able to identify a suitable natural key as a candidate for the job. I've been able to identify natural search keys, both unique and nonunique. But when I apply the prerequisite tests—is it unique? does it apply to all rows? is it minimal? is it stable over time?—to these natural keys, most natural keys fail. My personal preference is to use a surrogate key unless I can identify an appropriate natural key that meets the four criteria for the primary key.