If you are an Access user, or migrated from Access to SQL Server, you may think that a primary key and an Identity/Autonumber column are the same thing. Well, they’re not, and there are some very good reasons NOT to use an Identity column for the primary key on your tables.

Parent child relationships that go several levels deep are classic example. By the time you get to the 4th level, your FK is AT LEAST composed of 3 columns. Aside from being a pain to type the join statements, the actual join has so much more work to do. An identity column in this case makes shorter work of it. The composite key is handy as a unique alternate key to prevent duplicates on other columns. Using the identity also allows speedier performance when joining.

Identity columns take much less space also. If you index your foriegn keys (good idea!) and you have lots of composite keys you are using much more storage both in tables and indexes. In a small database no big deal... but in >GB size databases different story.

I think your argument for showing the LAN login is pretty weak. Showing them the name and other related info is much more usable than an often cryptic login. In this case, you would retrieve the record to get the extra info anyway so it makes little difference if the key was an identity or a login.

Sorry guys, but I think this article is not up to the usual standard. It's too easy to shoot holes in!

Perhaps a better idea woul dbe to show advatnages and disadvantages of identities and when to use alternative approaches.

I agree with the author that there are times when using an IDENTITY column as the Primary Key is ill advised and poor design. I believe that relationship tables especially should have the Foreign Key fields used as the composite Primary Key, although I still include an IDENTITY column just for singular field referencing in applications.

One of the issues that the author didn't address are the many negatives of using fields such as he refers to (LAN ID, etc) as Primary Keys. I can list a few.

1) The overhead to maintain cascading updates. If your LAN ID changes, you now need to go and change every row in every table that references that LAN ID with a trigger or some kind of application logic. This is unneccessary work.

2) The definite possibility of historical data contamination. If you leave your company, and somebody else comes in and is assigned your old LAN ID, what will HR do with all of your old employee data, or anything in the database that references your LAN ID?

3) The invariable nature of changing data. Yes, you may have a LAN ID of JSmith, but it may be for domain1. What if domain2 has a JSmith in it also? That means you then need to have add a domain field to make up the Primary Key (to each and every table). That means that your favorites table now has 3 fields for the Primary Key. The other option is to go and update all of your LAN ID's to include domain names...back to the cascading update problem. Why have to continually change the schema because of the fact that data needs will inevitably change.

4) I've actually found very few instances where one field in a table can be used as the Primary Key, so to get a unique row you have to use 2,3,4, sometimes more columns just to come up with a Primary Key. If you do that, you end up with SQL code that is almost unintelligle and much more difficult to maintain. Who wants to write a 2 table join that has to have 3 fields inner joined together?

5) Everything else is invariably slower. If you use your LAN ID or some other text value(s) to create your primary key, and you need to index them, they'll take up more index space, and be slower.

I do believe that there are times when using a particular field for a Primary Key is a valid solution, but I've found it to be the exception to the rule.

I also think that when talking about IDENTITY, Primary Keys, Constraints, and Indexes, you have to explain the interaction a little more succinctly, especially when talking to Access programmers, because I think that Access programmers take the concept of Primary Key, Autonumber, and indexing to basically mean the same thing.

And if I am all wet here, as Ron Soukup intimates, then I'd sure like to know how to go through the effort to design tables "right"! :-)

Should this article be retracted?From the logical point of view, the usage of identity columns may not be a necessity.However, when dealing with the Physical world, with large databases or high transaction databases with many updates, identities are a must.

With this being a series of articles, we should wait till the few show up before passing on judgment.

If the author still doesn't get it, would somebody please explain to the author the importance of surrogate/alternate keys. And then have the author read a few books on Data Warehousing.

I agree with others who point out that using a identity as a primary key is more practical in many situations. One concern I have, though, is how are identity fields handled in replication?

Say you have table_a, which uses an identity field for primary key my_key, in databases db1 and db2 that are part of a merge replication. A user inserts a row into db1..table_a, and in the same instant a different user inserts into db2..table_a, assigning the same value to my_key. Then a half-second later db1 and db2 are replicated against each other. What happens to the values in my_key? Does each row have a new value? Are the new values updated throughout table_a's relationships?

1) There is no standardization among tables, which makes organization more difficult for the guys writing the SQL.

2) It makes the incorrect assumption that the database is the front end to the user. The db should be used to store the data, the end user doesn't give a rats ass what the word "normalization" or "primary key" mean, whether they click on a button and it runs the query:

select stinkfactor from flowers where flowername='Rose' color='Red'

vs

select stinkfactor from flowers where flowerID=58

That primary key should be completely hidden to the end user. Even in complex search routines, the application itself should worry about the correct SQL to send to the DB, not the end user. That being said, i don't see the need for a logical primary key.

I also believe it is a bad idea to mix styles. If you use identity keys, use them in every table, if you use logical fields, use them in every table, if you use UUIDs, use them in every table.

Personally, I'm moving towards UUIDs, I'm able to pay the extra pennies in added disk space from an extra field for the added benefit of a standard table structure to explain to my SQL coders.

My real question is how much of a performance hit will i take going to a text string as my PK over an integer PK? I'm in search of good test results to show the difference there.

The statement that identity is totally arbitrary and useless violates the key idea behind data normalization. This is an asinine argument. You can always use other columns for unique constraints and even set your primary key to span multiple columns. You can even join tables on whatever columns you like. But the pure nature of numeric data in this column is that it is unique, small and can create an ordinal reference to set up complex normalization schemes with small amounts of data storage overhead. On the other side of this think about large datamarts and trying to manage these relationships on your own.

I am not commenting on the article altogether, but only a quote attributed to me, which I have never made. I certainly believe that sometimes Identity is a perfectly good choice as a PK. I have never said or believed otherwise. (Identity is not necessarily a great choice as your clustering index - but thats another discussion, and thats a physical issue, not a logical one. And you certainly could have your PK be on your identity column, and your clustered index on something else in any case.)

The author says he is 'paraphrasing' me. All I can think of is that he is referring to a comment I wrote in Inside SQL Server, where I decry the use of cursors purely to mimic ISAM style behavior with a series of nested selects. I said that people often do this as a quick and dirty way to port an ISAM app to SQL Server, but I viewed that approach only for people who think there is not enough time to do things right, but always enough time to do it over. My comment had nothing to do with identity as a PK choice. His paraphrasing of a quote of mine is towards an entirely different subject.

A better tongue in cheek saying that I'd use in this discussion, where I think Identity CAN be a perfectly suitable PK (but as always it depends), is "Generalizations are always wrong." Regards,Ron

I think it was the choice of identity as a clustered index.I'm pretty sure it did turn up a few times - can't remember where though. I've certainly seen it stated that you should never make an identity a clustered index - but not since v7 came out for some reason.Bit of a shame that doesn't still appear as it used to be a good interview question for people who had just read a few books.It may have originated from avoiding hot spots in oltp databases.

==========================================Cursors are useful if you don't know sql.Beer is not cold and it isn't fizzy.

quote:The author says he is 'paraphrasing' me...My comment had nothing to do with identity as a PK choice. His paraphrasing of a quote of mine is towards an entirely different subject.

First things first: Mr. Soukup is 100% correct, and I apologize for the context in which his quote was used. It was indeed taken from Inside SQL Server 6.5. As with just about everything in the article, the intention for using his words didn't translate in the final text; that failure is entirely mine, not Ron's.

I hope this didn't cause you undue grief Ron, and hope that you can believe it was meant in the spirit of "imitation is the sincerest form of flattery", which WAS my intention, honestly! Inside SQL Server 6.5 was the first SQL Server book I bought, and it's been a tremendous resource for me even today, when I only use 7.0 and 2000. Thank you.

At the risk of (once again) making my point badly, my reasoning for using Ron's quote went something like this: if you are like me, and MS Access was your first serious DB, AND you are going to SQL Server, you owe it to yourself to review your table designs BEFORE you move them up.

Those convenient "autonumber" columns that Access puts in for you may NOT be a good idea in a SQL Server database, as I learned much later at some cost. The inertia behind changing a design that "works", even if it's not a good one, was how I interpreted Ron's words, and I felt that simply upsizing Access tables was "a quick and dirty way to port an ISAM app to SQL Server", and that it was "for people who think there is not enough time to do things right, but always enough time to do it over".

Once again, my apologies for making a bad presentation of some excellent advice from Ron Soukup. As far as primary keys are concerned, I am forever keeping my mouth shut!

When Celko says you haven't got a table if you haven't got a key, I understand him to mean that you have to be able to identify each record uniquely. That certainly makes sense.

If you have a table with duplicates, then you have a problem. Numbering the rows e.g. by adding an autonumber/identity column allows you to identify each row uniquely but: it doesn't solve the core problem (addressed in the article): how do you differentiate between "Smith, John" in row 27 and "Smith, John" in row 345?

Identity columns should not be used as a way of making each row unique. The rows have to be uniquely identifiable without the identity column. Adding an identity then makes sense as a way to speed up joins, as several commentators have remarked.

I deal with school data across multiple schools. We have a lot of John Smiths. If someone wants the attendance data for a particular John Smith, they have to be able to identify that John Smith. In the real world, users do not see (should not be burdened with) our techie internal identity values so they will never say "the data for the John Smith with PK=1234567".

They will say "the data for John Smith in Class 5b, CityVille High School". Now if there are two of those (not uncommon), there has to be some way to tell them apart, some way accessible to users. In the real world, often but not always there is a Roll Number or Student ID number (the number that appears on a student's ID card, not the Identity value in the table!). The Roll No is guaranteed to be unique. It is therefore the ideal PK. For the same reason Social Security Number (National Insurance number in the UK) is a popular PK.

But the Roll No is not something known to most users including teachers (only the school admin people deal with roll nos) and sometimes there is no such convenient unique indentifier (passport numbers would be great, but they are not available to most users and probably illegal for this purpose).

So, until the UN decides to allocate a public GUID to each human at birth, we have to use a combination of two or more of name, middle initial, date of birth, house number etc that makes the probability of confusion (duplicates) negligeable. The probability of two students sharing forename, surname, middle initial and DoB is actually small, but maybe nt small enough. Adding house number should reduce it by a factor of 10 or more. Perhaps adding the name of the mother would make it so enough and so on, until you the expected number of dupicates over the lifetime of your system is small enough to live with.

Whatever combination we decide is going to be our way of uniquely identifying students should de facto have a unique constraint added. You may, once a blue moon, get a customer complaining they can't add a student because the system says the student is already in there. It is our 'natural' PK, but it's a real pain as a Foreign Key (FK), so, after adding our uniqueness constraint, we add an Identity column that functions as a kind of stand-in for our lengthy multi-column PK. We use that identity for joining etc.

I use Identity columns almost everywhere for making joins and programming easier but I *always* make sure my rows are unique (and enforce it with a constraint) without the Identity.

For data import/export and for replication, the Identity columns can be a real pain, as others have already remarked. But for such situations, Identity columns should not be included: they are not really part of the information contained in the record. If they are treated as a purely internal convenience and all your rows are unique without them, the import/export and replication problems do not arise.

A definition of a database is that every row in every table is unique - so if yuo follow this defintion and have duplicate rows then you don't have a database.

"For data import/export and for replication, the Identity columns can be a real pain, as others have already remarked. But for such situations, Identity columns should not be included: they are not really part of the information contained in the record. If they are treated as a purely internal convenience and all your rows are unique without them, the import/export and replication problems do not arise."

Nice generalisation but unfortunatly all situations should be treated uniquely. Sometimes it is good to transfer the identity, sometimes it isn't, sometimes it is essential. And the identity may be considered part of the information provided by the source.

Remember that things like identities don't have to be part of the logical model but can be added at the physical implementation stage.

==========================================Cursors are useful if you don't know sql.Beer is not cold and it isn't fizzy.

quote:For the same reason Social Security Number (National Insurance number in the UK) is a popular PK.

... (passport numbers would be great, but they are not available to most users and probably illegal for this purpose).

I've got bad news for you mono... Using the SSN as a primary key is illegal, too. But many companies still do it. There are a couple other problems with SSN as PK. I have heard (no confirmation) that SSN's get re-used after people pass away. Also, not every person has an SSN. What do you do then? Put in a dummy number? Then why not just put one in anyway?

And, of course, the common argument for using Identities, "What do you do when the SSN was entered incorrectly?" You have the hassle of cascading updates. But you have this possibility with whatever field you choose as a PK.

Sure it consumes resources for additional RI checking and it has a few limitations (cyclic). The extra time needed for this, IMHO is time well spent.

Yeah, yeah, large database are different. My definition of large is not total size but transactions per day!

I would rather have a natural over a surrogate key any day.If the PK is a Identity, it is not a relation at all. You can't update it!

And people bitching about size! Bloody hell what's a Gig worth these days, $20 bucks?

And people bitching about how it is hardier to write on multiple joins! It is not, only more time consuming (by the way most of us are paid on time!). These same people never tell you that natural keys drastically reduce joins in SELECT stamtements. It most cases none are required at all!

Yep SSN wouldn't be a good choice of identifier.There is also a rumour that at some point in the 50's in the UK they made a mistake and issued duplicate national insurance numbers - now the system that issues the numbers should have this with a unique constraint.

>> what's a Gig worth these daysAbout 6 months of effort in some companies it seems.

>> Using the SSN as a primary key is illegal??? there is a law against putting a constraint on a field in a database?

==========================================Cursors are useful if you don't know sql.Beer is not cold and it isn't fizzy.

quote:>> Using the SSN as a primary key is illegal??? there is a law against putting a constraint on a field in a database?

<begin disclaimer>As I was told... </disclaimer> in the U.S. it's not legal for anyone other than the U.S. Government (or maybe just the Social Security Administration) to require citizens to give their SSN. No other organization is supposed to use this number as an identifying piece of information. (I think this was to ease everyone's fears that they were being "marked by the beast"). However, it is common practice (and convenience) for companies to ask for your SSN in order to do credit checks and who knows what else. Very rarely have I heard of anyone refusing to give it, but I know a couple of folks who don't. Mostly, it's just damn inconvenient to jump through all the hoops if you refuse to give this info.