Breaking the Law: Primary Keys in SQL Server

Introduction

Throughout my travels in the IT industry, I have seen many database designs and it never fails to amaze me how many systems simply ignore the fundamentals of good design. I am not talking about poor designs from an academic point of view; these sub-standard designs don't work because they are a performance, programming, scalability and maintenance nightmare. One of problems I regularly see is implementing an improper key design in a table. In this article, I will discuss each of the crimes perpetrated against proper key design and each law that was broken. Names will be changed to protect the innocent.

The First Law: The primary key is a unique identifier for each row in a table.

The Crime: Guilty by Absence

The worst key crime is not adding a key at all to a table. What is amazing is that sometimes this will actually work, but I can predict with almost 100% certainty that it will fail at some point in time. Although it may not be efficient, it is possible to fetch the specific record you are looking for without a key defined. The same cannot be true when trying to update a record. In the old days of OLEDB and ADO, when the database driver could not get any key information concerning the Recordset, it would try to create its own key. How did it do this? A WHERE clause was generated with a condition against every field in the Recordset. This was done assuming there were no duplicate records in the table. Basically, under this condition, you had no idea what data was affected. Perhaps there were valid duplicate records. A single row update could change more than one record. Maybe a value had changed since the last dataset read, so that the WHERE clause would filter out all records and no records would be updated. This was the most inefficient and most error prone and bug ridden method you could devise. Fortunately, this “feature” was fixed in ADO.NET. You cannot update data without a key defined.

The Second Law: The primary key should be immutable.

The Crime: Guilty by Change

Here is one quick fix to the unique primary key problem that is not recommended. If you want to read or write to a specific row, that row needs a unique identifier that guarantees you are pointing to the record you want. SQL Server has a handy datatype that seems to suit this purpose quite well. The TIMESTAMP datatype provides a unique, sequential number. This datatype does not really represent a date or time. It simply is an 8 byte number that gets larger as time goes on. Why not stick a TIMESTAMP field at the end of table, define it the primary key and be done with it? There are several serious problems with this approach.

The TIMESTAMP data type was designed to control concurrency on data reads and writes. Every time any field gets updated, this field increases in value to the current server timestamp value. If a table has a TIMESTAMP field, the ADO.NET driver can check this value to see if the underlying data has changed since the data was last read. The database driver can then raise an error if updates are being made or you can use this information to determine if data is stale without having to grab all the record data. Unfortunately, this behavior is the last thing you want a primary key to do.

If you are using SQL Server as a relational database (and I hope you are), then you would have relations defined with primary and foreign keys. You would probably also have cascading updates set against the key value of the parent table. In larger systems, it is not uncommon to have dozens of relationships defined. If a TIMESTAMP was used as a primary key, every time any data update on any field is made, the TIMESTAMP would change. This would cause any related table to cascade the key update. The server may have to update many different tables and indexes to retain data integrity. This is a huge amount of extra work the server has to perform that it is perfectly meaningless and unnecessary. It is not difficult to see how this approach can bring a server to its’ knees with moderate traffic.

Primary keys should not change when they are created; they should be immutable. This eliminates the cascading updates problem and also prevents a couple others. By specifying a clustered index in the primary key, the records are physically stored in the order of the index. Retrieving and updating the data is much faster. If the key is changed, then the table becomes fragmented slowing down performance. Using immutable keys is a huge consideration for high-performance databases.

The Third Law – Keys are for the benefit of the computer.

The Crime: Guilty by Association

I understand that in the bad old days, bytes were expensive and people tried to get as much information as possible stored in as little space as possible (hence the year 2000 problem). Sometimes old habits are hard to break. We all know that all tables require keys, so why not use some of the data stored in the table as the key? People try to use a natural key based on some data elements unique to the data. I have never seen this work properly.

Determining a unique key is difficult. At some point in time, it may appear you have a good candidate for a key. Later as the system grows, you might realize that data changes cause your key to no longer be unique. Since the key is based in real data, it will most likely change over time causing the previous mentioned problems if the key is not immutable. Finding a single field value that is unique can also be a problem, so multicolumn keys are often used. This is a real pain to work with since you have to provide all the values to retrieve a record or you would have to perform multiple joins to get data from another related table.

The argument I hear for using natural keys is that they are easier to read and have some meaning to them. This is precisely the reason natural keys shouldn't be used. I have seen all sorts of wacky schemes to try and generate keys. A typical scheme to generate a key value might be something like:

023409808

where 0234 is the branch number, 098 is a sequential transaction number and 08 is the year. The problems with this are obvious. What if the branch or year changes? What if the company grows and there are more than 999 branches? What if two transactions are inserted at the same time and the both generate the same transaction number? Is this stored as a numeric value or a character datatype? This type of scheme is just asking for trouble. If you are ever tempted to use such a method, smack yourself in the head and quickly forget it. If you need some kind of unique system generated reference number, put that in a different column, but don't use it as you key. You will be saving yourself huge amounts of grief in the present and the future.

Keys are not meant to be readable by humans, they are for the benefit of the computer and the developer. If you want to provide human readable data, create views showing the information that you want, but don't store the data in that way.

An interesting side note: This scheme can support a maximum of 9,989,001 transactions per year for 99 years for a total of 988,911,099 transactions. An arbitrary int identity key can support a total of 2,147,483,647 transactions. You can double this with a clever trick of starting with an identity seed value of -2,147,483,648 and incrementing one each time. The computer doesn't care if the key is negative, it’s all just numbers. Assuming you choose to store the natural key scheme in an int field, using an arbitrary key allows you to have over 4 times the transactions in the same storage space.

The Fourth Law – Create efficient numeric keys.

The Crime: Guilty by Excess

Understand your data. Make your keys the correct size. In the goofy example above was used as a key stored in a character datatype (because you want the leading zero to show up), it would consume 18 bytes if stored in a nvarchar field. If you were half careful, this could be stored in a 9 byte varchar field. A int datatype could store this in 4 bytes quite easily. The int datatype is the best performing type in SQL so it is an excellent choice for a key column. The integer variant, bigint and tinyint are also good choices. Choose the size that best meets your data needs. Since we will no longer be using a coding method or actual data to derive a key value, let the computer do all the work and declare it as an identity column that automatically increments to create a unique, sequential value. SQL Server worries about ensuring unique values so you don't have to. Avoid GUID (uniqueidentifier) datatypes since they are difficult to work with and are only necessary under certain replication scenarios.

The Laws of Key Design Summary

Always add a primary key to all tables: A table without a key is not a table.

Always use immutable keys: Set it once, don't change it.

Let SQL Server do the work: Use identity columns.

Size the int key datatype to the data: Know your data, size it to fit.

I agree with almost all the points you've raised as I too am amazed at the 'quality' of some databases. However, I'd like to add a few things that might put a different perspective on keys.

I love natural keys (business keys), but only when they get to the user. That is I 'expose' natural keys while retaining my internal system keys (as you correctly point out they are for the system only!). This allows efficient joins on ints but also provides the benefits of natural keys to the user. A user will want to search on that funny ID you suggested because it has 'meaning' not some arbitary number. By having two keys (btw the natural keys are created as unique indexes) you can do a fast lookup of the main table and join to others via the internal system key.

But wait you want more???? To extend this discussion there's also times when a natural key isn't available (ie one that makes sense to a business person), but you still need to 'expose' a key. In this case I'd suggest a GUID, but again its a surrogate key, just like the natural keys. This eliminates the issue of performance on inserts. If you do use GUIDs, its best to try a COMB GUID but you can't always rely on your clients using one of those. This technique will avoid the issues of the GUID while retaining all its benefits.

So next time you design a table, don't just think of one key, because sometimes you might have three!! (in a logical sense, you obviously only have a single system primary key, which should always be an int, for so many reasons)

A machine can compare integers with one instruction and a lot to compare GUIDs and a lot more for strings.

In this way it is always effectiv to insert a new PK to the DB design, if there is a natural key. Set an index with no duplicates to the natural key.

Removing string from key fields to a helper key table can improve the performance a lot (I had it with going from minutes to seconds!).

On the same way PKs with multiple fields can replaced by an ident field. the former fields are indexed only. An depend table inherits this single new PK field and not a set of fields, less space and more speed!

I totally agree. I follow all these rules and thankfully haven't had a scenario where I've had to use a GUID (yet).

But unfortunately I've inherited systems that have a natural key, where the key field is a business field. This makes things difficult when you suddenly have the requirement for the user to be able to change this key and there are many tables which refer to this key.

I'll have to deal with this soon and hoping that I will be able to use foreign key constraints on each dependent table to automatically change their keys and hopefully won't be so painful an exercise.

I think DBAs and database people prefer natural keys because it fits more with relational design 'theory'. Developers prefer surrogate keys because they are much much easier to develop against. I'm a developer so I never use natural keys

I disagree with your assertions on Guid's as keys. The overhead is minimal and they are not much more difficult to use than simple Int keys. But they have the major advantage of being very easy to generate on either the client or server, before or after the record is inserted. And sequential guids means your clustered indexes wont be stuffed (as much).

Good points. Your comment on GUID's got me thinking. If you are to generate the key on the client side, GUID's are a great answer. Of course, if you do this, it cannot be a sequential GUID so you loose those benefits (smaller indexes).

I prefer to keep all the key logic on the server and return the key value back to the client on insert if required. The server is the keeper of the keys it should be the generator of the keys as well.

The overhead of GUID's over int or even bigint can be substantial. Here is a great article about it.

Using GUIDs as the primary key causes an overhead when the index is clustered; an int will naturally be added to the end of the index but a GUID could appear anywhere, requiring some major reorganisation when the row is stored.

Natural keys - yes I know they can cause a problem, however because SQL optimises queries, if the desired information can be determined from the columns in the index, it doesnt have to retrieve the row itself, this can give a boost in performance.
Graham

Thanks for sharing your thoughts and guidelines. However I'm wondering would you be able to elaborate more on the reason for avoiding Guid keys. I understand they are larger in byte size so they are going to increase index key size as a result but I would have felt they are almost essential if you want to provide disconnected access to your DB and merge this afterwards. Would it not be beneficial to design the DB using these from the start instead of finding out later down the line that you need to create a disconnected application and then add them in? I'd like to know your thoughts on this. Currently I use int keys but I am considering switching to guid keys.

Also I've read in some Sql Magazines that one approach to handling primary keys is to use your own Counter Table if you need to avoid gaps in the numbering sequence that can sometimes be caused by Indentity columns from rollbacks. What is your opinion on this method. Would you say it's a big no no or acceptable if requirements call for it.

If you need GUID's use them, but only if you need to consolidate inserted records from different databases. If you are taking a snapshot of data and merging it later to the same database, int data types will still work. See my comments in the previous post about GUID's.

Gaps in numbering....
This is a user problem. Accounting types freak when there are holes in the numbering scheme. The best answer is to tell them to get over it, the number is just a number that is bigger than the one before which is not always sequential.

Of course we can't always do that. The root problem here is that the key value is being used for something it shouldn't be used for; a man-friendly reference number. Create another field to store that number and use a counter table to increment it if you want. This will give you much more flexibility in generating and displaying the "reference number" than the int key field. Don't mix up the requirement computer friendly and man-friendly fields.

nice good article. The content was evident for me because I am already working with it quiet a lot. But definitely a good summary, and excellent for starters to get the explanations on why they should use it.