SQL SERVER – A Common Design Problem – Should the Primary Key Always be a Clustered Index

In SQL Server, whenever we create any key, a Primary Key automatically creates clustered index on the same. I like this feature and I use this feature every now and then.

The question is does the change of any column as Primary Key should also create a Clustered Index? Moreover, is there any case, where one would not do the same?

One of the recent conversations I had with one SQL Expert is with regard to the SSN number. The discussion was that SSN numbers are always unique and never repeated and hence are the best candidates for primary key. Additionally SSN numbers contains dashes (-), which make the datatype of the SSN numbers as String (VARCHAR or NVARCHAR). A clustered index on an integer usually performs better over a clustered index on an integer and makes the DBA to choose Primary Key of datatype Integer. At one point in our conversation, we discussed that if SSN number should be a Unique Constraint and if there should be another Identity Column as the Primary Key.

Some of the questions from our interesting discussion are as follows:

Would you have the datatype of your Primary Key as string?
Would you treat SSN as string datatype or remove the dashes and change it into an integer?
Do you have a real life example, where your primary key is not a clustered index?
What are the best practices for SSN number to store in database and obtain optimal performance?

I will post an interesting answer discussing this subject in a separate post with due credit.

22 thoughts on “SQL SERVER – A Common Design Problem – Should the Primary Key Always be a Clustered Index”

In the USA, SSN’s are composed of three parts. The first three digits are the Area Number, the next two digits are the Group Number, and the final four digits are the Serial Number. (http://www.ssa.gov/history/ssn/geocard.html) I would store a SSN as a set of smallint, tinyint, smallint columns. Doing so would provide a natural constraint to numbers which a varchar would lack. (I would add further declarative constraints as BETWEEN 0 AND 999 and BETWEEN 0 AND 99 and BETWEEN 0 AND 9999.) Now you have a five byte key with good contraints. If most queries are based primarily on SSN, you could make them the compound PK.

When the data schema for the system I am working on today was incepted 9 years ago, all of the primary keys (which were GUIDs) were assigned as clustered (by default).

As the databases grew in size, due to fragmenting indexes, it became more apparant that this was a practice that was overused in many areas. Especially in tables where there is a large amount of insert/deletes.

We have since gone back and re-evaluated some of the key tables to correct for this.

One example would be in the invoicing module, where an invoice may have 1-30,000+ invoice details, the clustered index was moved to the foriegn key invoice relationship from the invoice detail primary key. This drastically improved both load and write performance.

Pinal,
I would not recommend using the SSN for a primary key. SSN’s are difficult to work with partly because of there structure but more importantly because they are by nature Personal Identifying Information (PII) which must be protected. Any time SSN’s are stored they should be encrypted until such time as the information is required and even then the transmission should be encrypted. If you convert the SSN to an integer then when you require the information you will have to restore the data structure, including adding leading zeros, or store the information twice. I would give each person a unique identifier and store the SSN, encrypted, as part of the person’s demographic data along with DOB and other information that typically does not change. I work in the health caer industry and protecting PII is a constant concern.

From the linked document “This is an archival or historical document and may not reflect current policies or procedures”

@Pinal

Should SSN be the PK? That matters. If the purpose of the TABLE is to list SSNs, such as the agency itself may do, SSN should indeed be the PK, because each number is listed only once and it is the point of the TABLE.

If, however, the purpose of the TABLE is customers at a Tax preparation business, the point of the TABLE is the User and the SSN is an attribute. Which makes it a bad candidate for the PK, as the attribute may change.

Also, SSNs are reused over time. So, if this is an SSN TABLE that by definition lists each SSN once (as opposed to an SSN TABLE that lists all editions) SSN will be unique, because the person is the attribute, and that is what changes. But, if a firm has historical information on tax preparation they may well have the same SSN for two different people if it is kept long enough.

There are many problems with using SSN as primary key as Jay pointed out. It is sensitive data, and the idea that it is unique is well..how things have to be. There are duplicate SSNs for various reasons. Generally speaking any information supplied by outside sources including State agency is a poor candidate for a primary key. That said primary key and clustered index are conceptually two different things. A Primary key is a design concern, how the data is identified in a table. A clustered index is a performance concern and whatever key makes the table perform best is an ideal clustered index.

From a conceptual POV I tend to view clusterd indexes as a physical order of information on disk. Even when this isn’t really exactly true the end result is a good enough approximation for reasoning about design.

If the structure of the primary key diverges from the usage pattern of the table its a safe bet you may want to *think* about making them different.

For example a queue table where you are always either adding records to the bottom or pulling from the top. Assume the tables primary key effectivly contains random values.

In this case if you make the primary key a clustered index it will fragment the index and reduce throughput.

If instead a sequential identifier was used as the clustered index there would be much less fragmentation and higher throughput. More importantly when bulk push/pulling records to and from the table the clustered index will be used directly to resolve the search range leading to less “exploring” in terms of locks and non-sequential I/O.

SQLServer resolves update operations to the clustered index so one should generally not take lightly the idea of separating the primary key from the clustered index! If your updating rows much more often than adding or removing records theres a good chance thinking about making the primary key different will not pay off.

Would you treat SSN as string datatype or remove the dashes and change it into an integer?
NAV: I would say it depends on the use (going by your famous “It depends” clause :) ). If the purpose is simply to give a unique number, then we should convert to an integer.

Do you have a real life example, where your primary key is not a clustered index?
NAV: No.

What are the best practices for SSN number to store in database and obtain optimal performance?
NAV: I would say, when storing SSN, split it up into three parts – 3:3:4.These 3 columns can then be made a composite primary key.

I’m little late to the subject but wanted to point out couple of real problems in using SSN as PK.

First. In Finland there is a rule how you can distinguish man from woman when lookin at SSN. Now there’s a possibility that someone goes and changes his/her sex so the SSN changes too. So that would mean that you’d have to change PK field in DB and … well, I don’t want to even start thinking about that.

Second. SSN goes in the form of 010101-XXXY. That means that person is born January the first 1901. X is a latter or digit which forms unique combination with the birthdate. Y is a hash of it all. Now if person is born in 2001 SSN goes 010101A1234. Or if person is born in 1981 it goes 010101+1234. Because of this you can’t convert SSN into integer or you would have to figure out some kind of mapping so that 0=+, 1=1, 2=A and so on.

To put it in short: Never use SSN as PK. Or any other information that has the slightest possibility that it would change.

But I’d like to read more about nonclustered PK. I came up with the following example but I’m not sure if this is sound… I like to keep PK fields hidden as much as possible so that only the lowest level of application knows about it and never show it to the user. This is why I usually add fields like OrderNumber which is then displayed in the UI.

Datatype would be : Varchar(11), but you can have a check constraint that says first 3 Characters should be Numbers and 4th character should be hipen and next two letters should be Numbers and then followed by a hipen and then next 4 characters should be Numbers.

Check out Books Online, under topic Check Constraints, I believe Books Online has one of these examples.

Hello guys. Great post. Please, let me point out that in the early 1930’s in US the SSN number was shared between couples. Some of these people are still alive and these marriages still share their SSN. That’s why the US Health Department created the HCFA Number which is the SSN with a suffix that distinguishes between husband and wife. Not that it was the primary key, but it was supposed to be unique, I you had worked with elder Healthcare systems you know what I’m talking about,

Q: Would you have the datatype of your Primary Key as string?
A:No. SQL server create clustured index by default on primary key if we not specify externally. and data in table physically stored based on clustured index so its not good to have primary key on ssn (string data type) because record in table is sort according to primary key (clustured Index).

Q:Would you treat SSN as string datatype or remove the dashes and change it into an integer?
A: Its also not a good idea to do this. we should specify non clustured unique index on it. Its enough no need to convert it into integer.

Q: Do you have a real life example, where your primary key is not a clustered index?
A:no. we specify our primary key as identity(1,1) and it create clustured index by default.

Q: What are the best practices for SSN number to store in database and obtain optimal performance?
A: for SSN it should be non clustured unique Index.

Very interesting blog as always. I don’t see a problem using a SSN as a PK for United States SSNs, but casting them as int or tinyint might cause a problem. Valid US SSNs can begin with a zero. While this would not present a problem in the table itself, it may present problems in ETL operations with other tables where leading zeros (or dashes between the 3rd and 4th and 5th and 6th character) are preserved. Best to specify SSN as a non clustered unique index in my opinion.

Community Initiatives

About Pinal Dave

Pinal Dave is a Pluralsight Developer Evangelist. He has authored 11 SQL Server database books, 17 Pluralsight courses and have written over 3200 articles on the database technology on his blog at a http://blog.sqlauthority.com. Along with 11+ years of hands on experience he holds a Masters of Science degree and a number of certifications, including MCTS, MCDBA and MCAD (.NET). His past work experiences include Technology Evangelist at Microsoft and Sr. Consultant at SolidQ. Follow @pinaldave
Send Author Pinal Dave
an email at pinal@sqlauthority.com

Email Subscription

Enter your email address to subscribe to this blog and receive notifications of new posts by email.