Information dropping infrequently from my brain into a vast ocean of network packets.

Friday, September 07, 2007

peeve no. 250 is unnecessary surrogate keys

The Wikipedia definition of surrogate key is here. In my database experience, it takes the form of an "autonumber" column, also referred to as "serial" or "identity". Whatever the database product or nomenclature, each new row has a guaranteed-unique, generated value in that column. The commonness of surrogate key columns verges on mundane.

As with the other peeves, my irritation has subjective and objective aspects. Objectively, a surrogate key has the downside of being fairly useless for queries for multiple rows--which also makes its index a fairly useless optimization. This is tempered by the syntactical ease of retrieving one known row, but that case is really more of a "lookup" than a "query", don't you think? (I would contrast the selection and projection of tuples, but why bother...)

Another objective danger with unnecessary surrogate keys, perhaps the worst, is the duplication of data which shouldn't be duplicated. By definition the surrogate key has no relationship with the rest of the columns in the row, so it isn't a proper primary key for the purpose of database normalization. If the cable company's database identifies me by an auto-generated number, and I sign up again as a new customer paying promotional rates after I move, the database won't complain at all. In practice, of course, the entry program should aid the user in flagging possible duplicates, but clearly that lies outside the database and the data model. Admittedly, judging whether "Joe Caruthers in Apt B3" is a duplicate of "Joseph Caruthers in Building B Apartment #3" is a decision best left up to an ape descendant rather than the database.

The objective tradeoffs of a surrogate key are worthy of consideration, but my subjective disdain for unnecessary surrogate keys goes beyond that. I feel violated on behalf on the data model. Its relational purity has been sullied. Lumping in an unrelated number (or value) to a row of real data columns feels out of place, like adding a complex "correction factor" to a theoretically-derived equation so it fits reality. But the unnecessary surrogate key has the opposite effect: it causes the table to have less resemblance to what it models. An unnecessary surrogate key leads someone to wonder if the table was the victim of a careless and/or thoughtless individual who went around haphazardly numbering entities to shut the database up about primary keys. Normalization, what's that?

I've seen enough to realize the convenience of a surrogate key, especially for semi-automated software like ORM mappers. It's also clear that sometimes a surrogate key is the only feasible way of uniquely identifying a set of entities, particularly when those entities are the "central hubs" of the data model and therefore the corresponding primary keys must act as the foreign keys for a large quantity of other tables. The technical downsides of a surrogate key can be mitigated by also including a real primary key for the table.

If the overall design doesn't make obvious the choice of each table's primary key, that's a clue the design should be reevaluated. (You did design the tables before creating them, right?) Don't forget that sometimes the primary key consists of every column. The table that serves as the "connecting" or "mediating" table in a many-to-many join is a good example, because it's essentially a table of valid combinations.

No comments:

Post a Comment

About Me

I blog as Art Vandalay for the following reasons: 1. less chance of readers prejudging the value of my opinions based on who I am, 2. greater freedom to say whatever I like without fear of it affecting my employment (but I acknowledge that no one can be purely anonymous on the usual Web), 3. I just want to separate my online persona from the "real" me. More explanation.

materialistic naturalism

1. Under the standard of sufficient impartial scrutiny, any allegedly supernatural things don't demonstrably exist. And even if one or more are still assumed to exist, none have demonstrable relevance on how the universe operates. (But these ideas may nevertheless affect culture, behavior, and thought, like many other mistaken ideas have.) 2. Moreover, all existing natural things come from, are composed of, participate in, and will eventually wear down into, material stuff: physical substances acting according to consistent forces.