As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

1

Since this is subjective, it should be a community wiki
–
Runscope API ToolsDec 31 '08 at 21:14

27 Answers
27

If you're going to be doing any syncing between databases with occasionally connected apps, then you should be using GUIDs for your primary keys. It is kind of a pain for debugging, so apart from that case I tend to stick to ints that autoincrement.

Autoincrement ints should be your default, and not using them should be justified.

A GUID isn't necessary, just change the step to 10 or 20 or however many server's you'll need to possibly sync with in the future.
–
Robert C. BarthDec 31 '08 at 22:06

26

90% of the time at least, a GUID is not needed and wastes space.
–
Jonathan LefflerJan 1 '09 at 3:10

8

I seriously feel GUIDs is an overkill. Never ever had a need to have GUIDs as my primary keys yet.
–
Cyril GuptaJan 3 '09 at 4:10

6

Or, instead of wasting space and risking collision with a GUID, make a composite key of the original primary key and a small identifier, where the small identifier is different for each sync source.
–
L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳Feb 7 '10 at 22:25

3

A shop I worked for used GUIDs for everything, even when public identifiers were available, like ISO country or language codes. And even when a boolean or CHAR(1) would have been enough, like for sex. Needless to say, it was a nightmare to work with.
–
LumiMay 23 '11 at 18:11

I don't see an answer which points out (what I regard as) the really fundamental point - namely, that a primary key is what guarantees that you won't get two entries in the table for the same real-world entity (as modelled in the database). This observation helps establish what are good and what are bad choices for primary key.

For example, in a table of (US) state names and codes, either the name or the code could be the primary key - they constitute two different candidate keys, and one of them (normally the shorter - the code) is chosen as the primary key. In the theory of functional dependencies (and join dependencies - 1NF through 5NF - it is the candidate keys that are crucial rather than a primary key.

For a counter-example, human names generally make a bad choice for primary key. There are many people who go by the name "John Smith" or some other similar names; even taking middle names into account (remember: not everyone has one - for example, I don't), there is plenty of scope for duplication. Consequently, people do not use names as primary keys. They invent artificial keys such as the Social Security Number (SSN) or Employee Number and use them to designate the individual.

An ideal primary key is short, unique, memorable, and natural. Of these characteristics, uniqueness is mandatory; the rest have to flex given the constraints of real world data.

When it comes to determining the primary key of a given table, therefore, you have to look at what that table represents. What set or sets of column values in the table uniquely identifies each row in the table? Those are the candidate keys. Now, if each candidate key consists of 4 or 5 columns, then you might decide that those are too clumsy to make a good primary key (primarily on grounds of shortness). In those circumstances, you might introduce a surrogate key - an artificially generated number. Very often (but not always) a simple 32-bit integer is sufficient for the surrogate key. You then designate this surrogate key as the primary key.

However, you must still ensure that the other candidate keys (for the surrogate key is a candidate key too, as well as the chosen primary key) are all maintained as unique identifier - normally by placing a unique constraint on those sets of columns.

Sometimes, people find it difficult to identify what makes a row unique, but there should be something to do that, because simply repeating a piece of information doesn't make it any more true. And if you're not careful and do get two (or more) rows purporting to store the same information, and you then need to update the information, there is a danger (especially if you use cursors) that you will update just one row rather than every row, so the rows are out of synchrony and no-one knows which row contains the correct information.

This is a pretty hard-line view, in some respects.

I've no particular problem with using a GUID when they are needed, but they tend to be big (as in 16-64 bytes), and they are used too often. Very often a perfectly good 4-byte value would suffice. Using a GUID where a 4-byte value would suffice wastes disk space, and slows up even indexed access to the data since there are fewer values per index page, so the index will be deeper and more pages have to be read to get to the information.

Regarding your sample with US state names I would prefer a separate surrogate key, simply because the codes are something beyond your control. If they should change for whatever reason you get a problem.
–
Dirk Vollmar - 0xA3Jan 14 '09 at 23:43

(continued) For example, Germany replaced a 4-digit zip code system with a 5-digit system back in the 1990ies after the re-unification.
–
Dirk Vollmar - 0xA3Jan 14 '09 at 23:44

@divo: I am a strong advocate of artificial/surrogate keys, but even I don't see the 4-digit to 5-digit postal code change as being a good example. Postal codes are not generally used as keys to anything. (When's the last time you had to query a PostalCode table to find out something about that code? No, it's almost exclusively used as part of an address without being referenced in any other tables. I would say your suggestion is almost on par with using surrogate keys for addresses themselves.)
–
ErikEFeb 2 '10 at 17:55

@Emtucifor: Yes, maybe ZIP is not a very practical example, but my point was that if part of your surrogate key is out of your control and changes for whatever reason, you are in trouble. Think of someone creating a new social security number scheme, a new ISSN scheme or - maybe more realistic - a company deciding to create a new product id system after a merge, assigning new employee numbers to their employees to adjust their growth etc. These are all just fictional examples, but, as my previous example with the ZIP shows, sometimes a well-established system might change.
–
Dirk Vollmar - 0xA3Feb 2 '10 at 19:18

@divo: Agreed that Social Security Number is a good example.
–
ErikEFeb 3 '10 at 0:40

This is only a religious issue because people seek a universal right answer. The fact that both your team and this SO thread shows so much disagreement should be a clue that there are good reasons to use all the solutions you describe, in different circumstances.

Surrogate keys are useful when no other attribute or set of attributes in the table is suitable to identify rows uniquely.

Natural keys are preferred, when possible, to make the table more human-readable. Natural keys also allow the foreign key in a dependent table to contain a real value instead of a surrogate id. E.g. when you need to store state (CA, TX, NY) you might as well use a char(2) natural key instead of an int.

Use compound primary keys where appropriate. Do not add an "id" surrogate key unnecessarily when a perfectly good compound key exists (this is especially true in many-to-many tables). A mandate for a three-column key in every table is absolute nonsense.

GUIDs are a solution when you need to preserve uniqueness over multiple sites. They are also handy if you need values in the primary key to be unique, but not ordered or consecutive.

INT vs. BIGINT: it's not common that a table requires a 64-bit range for primary keys, but with the increasing availability of 64-bit hardware it shouldn't be a burden, and gives more assurance that you won't overflow. INT is of course smaller, so if space is at a premium it can give a slight advantage.

I disagree as much as a person can possibly do so. Natural keys are horrible. What if one wants to change the data? Oh, you can't. Writing joins on composite natural keys is a pain. Carrying that composite key to all your related tables is a waste.
–
Robert C. BarthDec 31 '08 at 22:23

2

@Robert: read about "ON UPDATE CASCADE". But I get what you are saying, and I agree it's best to use a surrogate key most of the time, because attributes are subject to change and to be non-unique.
–
Bill KarwinJan 1 '09 at 0:43

Primary keys should be immutable. Cascade updates are only a ugly hack for a bad design decision in this case. Natural keys are NEVER preferred. Same to composite keys, that spread themselves like a plague. Anyone with more than 3 months of database development experience would know this.
–
F.D.CastelJan 5 '09 at 6:28

5

@F.D.: I don't agree with your unequivocal statement, and I've been developing with SQL databases since 1992. But certainly it's true that surrogate keys are best able to remain immutable.
–
Bill KarwinJan 5 '09 at 18:59

3 columns for a primary key? I would say that columns should have appropriate unique constraints as the business rules demand, but I'd still have a separate surrogate key. Compound keys mean business logic enters into the key. If the logic changes, your whole schema is screwed.

I always go with the surrogate key. A surrogate key (usually an identity column, autoincrement, or GUID) is one in which the key is not present in the data itself. A natural key, on the other hand, is one that, on its own, uniquely identifies the row. As near as I can tell in life, there are hardly any real natural keys. Not even things like SSN in the United States is a natural key. Composite primary keys are a disaster waiting to happen. You can't edit any of that data (which is the major drawback of any natural key, composite or not), but worse is that with a composite key, now you have to perpetuate that key data into every related table. What a giant waste.

Now, for selection of the surrogate key, I stick with identity columns (I work mostly in MS SQL Server). GUID's are too large and Microsoft recommends against using them as a PK. If you have multiple servers, all you need to do is make the increment 10 or 20 or whatever you think the maximum number of servers you'll ever need to sync/expand to, and just inc the seed for each table on each subsequent server, and you'll never have a data collision.

Of course, because of the increment, I make the identity column a BigInt (otherwise known as a long [64 bits]).

Doing a bit of math, even if you make the increment 100, you can still have 92,233,720,368,547,758 (> 92 quadrillion) rows in your table.

One thing you should never do is use a smart key. That is a key where information about the record is coded in the key itself, and it will eventually bite you.

I worked one place, where the primary key was the account ID, which was a combination of letters and numbers. I don't remember any specifics, but, for example, those accounts that were of a certain type, would be in the 600 range, and of another type, started with 400. That was great, until that customer decided to ask for both types of work. Or changed the type of work they did.

Another place, used the location in the tree as the primary key for records. So there would be records like the following.

I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.

First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,

Then, having any key serves several often mutually inconsistent purposes.

To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)

(related) Ensuring that child records must have a parent record in the parent tab;e (The child table FK must exist as Key in the parent table)

To increase perforamce of queries that need to rapidly locate a specific record/row in the table.

To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)

Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying #4.

But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.

The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...

As to GUIDs, be very careful using them, as using guids in an index can hose index fragmentation. The most common algorithms used to create them puts the "random" portion of the guid in the most significant bit positions... This increases the requirement for regular index defragmentation / Reindexing as new rows are added.

If your primary key is a GUID, do not make it a clustered index. Since GUIDs are non-sequential, the data will be re-arranged on disk during almost every insert. (Yuck.) If using GUIDs as primary keys, they should be nonclustered indexes.

Very good point - one needs to distinguish between the LOGICAL concept of a primary key (might be valid to use a GUID for that, especially if replication is involved), and the PHYSICAL concept of the clustering key - that should NEVER be a GUID since it leads to excessive index fragmentation
–
marc_sJan 7 '09 at 6:39

3

This is in fact not accurate. The data will be inserted in order, which given the GUID's random nature could end up being anywhere across the table. On the off chance that there isn't room, a page split will happen, but certainly not "re-arranging on disk during every insert" not even close.
–
Ralph ShillingtonOct 16 '09 at 23:23

I'm a fan of the auto-increment as primary key. I know deep in my heart that this is a cop-out, but it does make it so easy to sort data by when it was added (ORDER BY ID DESC, f'r instance).

3 columns sounds awfully harsh to humanly parse.

And that's the trade-off -- how much of the relational capability do you need, versus making THIS TABLE RIGHT HERE understandable to a human interrogating it (versus the stored-procedure or programmatic interface).

Are you talking about a "natural key", e.g. "name and date of birth"? A natural key might be ideal if it exists, but most candidates for a natural key are either not unique (several people with the same name) or not constant (someone can change their name).

Int/ BigInt which autoincrement are good enough primary keys.

I prefer Guid. A potential problem with autoincrement is that the value (e.g. "order id") is assigned by the database instance (e.g. by the "sales database") ... which won't entirely work (instead you start to need compound keys) if you ever need to merge data created by more than one database instance (e.g. from several sales offices each with their own database).

Primary keys are required to be unique, but are not required to be constant. Hence foreign keys declared with "ON UPDATE CASCADE". But making an assumption that primary keys are constant helps to simplify many applications. This is one benefit of surrogate keys.
–
Bill KarwinDec 31 '08 at 21:40

Auto increment columns. I am able to make my code work seamlessly with SQL Server or Oracle, one using identity the other using sequences through my DAL, and I couldn't be happier. I agree, GUIDs sometimes are necessary if you are doing replication or sending data away to receive it later on afer processing.

Watch out if this is going to be a really Really REALLY REALLY big database, lots of load, and fast access.

At my last job, where we had databases of 100 to 500 million records, our database guys strongly argued against GUIDs, and for an appropriately sized decimal number. They felt that (under Oracle) the size difference in the internal storage for a string Guid - vs- a decimal value would make a very noticeable difference in lookups. ( Bigger keys = deeper trees to traverse)

The random nature of GUIDs also reduces the fill-factor for index pages significantly - this dramatically increases tearing and disk I/O.

"Reduces the fill-factor"? Not sure what that could mean Fill-factor is a one shot deal, defined as the percent of free space requested at the leaf-level of the index at the time the index is built. GUID values by their random nature distribution across the breadth of the leaf-level on inserts into that free-space that fill-factor provided.
–
Ralph ShillingtonOct 17 '09 at 1:32

Since when is a GUID a string? GUIDs should be stored internally as 16 bytes by any respectable DBMS. Storing as 32 bytes in the hex representation would be unconscionable! (or 36 with dashes, or 38 with curly braces)
–
ErikEFeb 2 '10 at 18:04

This is a classic "it depends". There's no one right answer for every project. I like different things for different situations. It depends on whether I'm using an ORM and what it supports. It depends on the overall architecture (distributed or not, etc). Just pick one that you think will work and move on to arguing over tabs and spaces.

I tend to use option #1 or #3 depending on the size, the number of people connecting, and whether it is a multiple database server situation or not.

Option #2 doesn't make much sense to me. If any one of the three is not enough to identify a unique record, then it's possible (without going through extra machinations) two have two records show up with the same values in all three columns. If you want to enforce uniqueness on any combination of the three, then just add an index for them.

I've only use an auto-increment int or a GUID. 99% of the time I've use auto-increment int. It's just what I was taught to use when I first learned about databases and have never run into a reason not to use them (although I know of reasons why a GUID would be better).

I like auto increment ints because it helps with readability. For example I can say "take a look at record 129383" and it's pretty easy for someone to go in and find it. With a GUID that's nearly impossible to do.

Past a basic definitional answer, what constitutes a good primary key is left largely to religion and break room arguments. If you have something that is, and will always, map uniquely to an individual row, then it will work fine as a primary key. Past that point, there are other considerations:

Is the primary key definition not overly complex? Does it avoid introducing unnecessary complexity for the sake of following a "best-practice"?

Is there a better possible primary key that would require less overhead for the database to handle (i.e. INTEGER vs. VARCHAR, etc)?

Am I ABSOLUTELY certain that the uniqueness and defined-ness invariant of my primary key will not change?

This last one is likely what draws most people to use things like GUIDs or self-incrementing integer columns, because relying on things like addresses, phone numbers, first/last names, etc, just don't cut it. The only invariant about people I can think of is SSNs, but then I'm not even 100% certain about those remaining forever unique.

The way I approach primary keys (and I feel is the best) is to avoid having a "default" approach. This means instead of just slapping on an auto-incrementing integer and calling it a day I look at the problem and say "is there a column or group of columns that will always be unqiue and won't change?" If the answer is yes then I take that approach.

The latter may be an integer, with dashes added and in base 16. But yes, 404040 is faster to process than the long GUID. Then again, 0 is even faster to process because it doesn't require a single bit of data!
–
stragerJan 1 '09 at 20:26

Only slightly relevant, but one thing I've started doing recently when I have small classification tables (essentially those that would represent ENUMs in code) is that I'll make the primary key a char(3) or char(4). Then I make those primary keys representative of the lookup value.

For example, I have a quoting system for our internal Sales Agents. We have "Cost Categories" that every quote line item is assigned one of... So I have a type lookup table called 'tCostCategories', where primary key is 'MTL', 'SVC', 'TRV', 'TAX', 'ODC'. Other columns in the lookup table store more details, such as the normal english meanings of the codes, "Material", "Service", "Travel", "Taxes", "Other Direct Costs", and so forth.

This is really nice because it doesn't use any more space than an int, and when you are looking at the source data, you don't have to link the lookup table to know what the heck the value is. For example, a quote row might look like:

It's much easier that using an int to represent the categories and then linking 1, 2, 3 on all the lines - you have the data right there in front of you, and the performance doesn't seem affected at all (not that I've truly tested.)

As far as the real question goes... I like RowGUID uniqueidentifiers. I'm not 100% on this, but don't all rows have internal RowGuid's anyway?? If so, then using the RowGuid would actually take less space than ints (or anything else for that matter.) All I know is that if it's good enough for M$ to use in GreatPlains then it's good enough for me. (Should I duck??)

Oh one more reason I use GUIDs - I use a hierarchical data structure. That is, I have a table 'Company' and a table 'Vendor' for which the Primary Keys match up. But I also have a table 'Manufacturer' that also 'inherits' from Company. The fields that are common to Vendors and Manufacturers don't appear in those tables - they appear in Company. In this setup, using int's is much more painful than Guids. In the very least, you can't use identity primary keys.

Yes you can, you just don't make the subtype tables have the identity property, instead they get explicit inserts of the supertype table value. Please see stackoverflow.com/questions/2112882/…
–
ErikEFeb 2 '10 at 18:08

I like natural keys, whenever I can trust them. I'm willing to pay a small performance price price in order to use keys that make sense to the subject matter experts.

For tables that describe entities, there should be a simple natural key that identifies individual instances the same way the subject matter people do. If the subject matter does not have trustworthy identifiers for one of the entities, then I'll resort to a surrogate key.

For tables that describe relationships, I use a compound key, where each component references an entity that participates in the relationship, and therefore a row in an entity table. Again, the performance hit for using a compound key is generally minimal.

As others have pointed out, the term "primary key" is a little misleading. In the Relational Data Model, the term that's used is "candidate keys". There could be several candidate keys for a single table. Logically, each one is just as good as another. Choosing one of them as "primary" and making all references via that key is simply a choice the designer can make.

"trustworthy" is not a property of a key by itself. Rather, it has to do with the key in the context of the people who supply the data. If you are writing an app to be sold to somebody who will actually be managing the data, you have to guess which keys will be trustworthy to the client or not. Given the variety of clients, you're almost surely going to guess wrong for some fraction of your clientele.
–
Walter MittyFeb 3 '10 at 14:23

Having said the above, here's an example of a key that we trusted way back when. We had a database about courses. It included textbooks and other course materials about courses, scheduled course offerings, instructors who were qualified to teach courses, course prerequisites, tuition, and so on. When course development created a new course, one of the first things they did was to assign a course code. They were responsible for making sure that course codes were unique, and that courses never changed their code, once assigned. It was part of the data as given to us.
–
Walter MittyFeb 3 '10 at 14:30

Another good example of trusted natural key is VIN (Vehicle Identification Number). For the last many years, every vehicle sold as new has a VIN attached to it. They can be trusted to be unique and unchanging.
–
Walter MittyFeb 3 '10 at 14:32

In the event that you need to scale out or you need to assign the primary key by alternate means they will be your friend. You can add indexes for everything else.

update to clarify my statement.

I've worked on a lot of different kinds of sites. From small single server deals to large ones backed with multiple DB and web servers. There have certainly been apps that would have been just fine with auto incrementing ints as primary keys. However, those don't fit the model of how I do things.

When using a GUID you can generate the ID anywhere. It could be generated by a remote server, your web app, within the database itself or even within multiple databases in a multimaster situation.

On the other hand, an auto incremented INT can only be safely generated within the primary database. Again, this might be okay if you have an application that will be intimately tied to that one backing DB server and scaling out is not something you are concerned with.

Sure, usage of GUIDs mean you have to have nightly reindexing processes. However, if you are using anything other than an auto incremented INT you should do that anyway. Heck, even with an INT as the primary it's likely you have other indexes that need regenerated to deal with fragmentation. Therefore, using GUIDs doesn't exactly add another problem because those tasks need to be performed regardless.

If you take a look at the larger apps out there you will notice something important: they all use Base64 encoded GUIDs as the keys. The reason for this is simple, usage of GUIDs enables you to scale out easily whereas there can be a lot of hoops to jump through when attempting to scale out INTs.

Our latest app goes through a period of heavy inserts that lasts for about a month. After that 90+% of the queries are all selects for reporting. To increase capacity I can bring up additional DB servers during this large insert period; and later easily merge those into a single DB for reporting. Attempting to do that with INTs would be an absolute nightmare.

Quite frankly, any time you cluster a database or setup replication the DB server is going to demand that you have GUIDs on the table anyway. So, if you think that your system might need to grow then pick the one that's good.

"Guids.period": That is so wrong. GUIDs should be used where appropriate. As the other commenter pointed out, it might make life as a programmer easy, but affects the overall size and performance of the DB.
–
Mitch WheatJan 1 '09 at 1:33

At the end of the day, I can scale my apps out across multiple database servers without issue. But I guess you guys work on small sites.
–
NotMeJan 2 '09 at 18:18

3

GUID might be ok for the logical primary key, but NEVER EVER EVER use a GUID column as your CLUSTERING key - you'll be drowning in index fragmentation leading to POOR performance .....
–
marc_sJan 7 '09 at 6:40

This is a complex subject whether you realized it or not. Might fall under the section on this StackOverflow FAQ.

What kind of questions should I not ask here?

Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!

This has been debated for years and will continue to be debated for years. The only hints of consensus I have seen is that the answers are somewhat predictable depending on if you are asking a OO guy (GUIDs are the only way to go!), a data modeler (Natural keys are the only way to go!), or a performance oriented DBA (INTs are the only way to go!).

I will not let the discussion go to long. I was just curious to see the general consensus.
–
PerpetualcoderDec 31 '08 at 21:34

1

I say ask whatever questions you wish! Else, this community will become static and overcontrolled like wikipedia seems to have become. Seems to me like some times you need to let people ask whatever do choose to ask. Trust them, and they might come to trust themselves!
–
Nicholas LeonardJan 4 '09 at 23:16