Ten Common Database Design Mistakes

If database design is done right, then the development, deployment and subsequent performance in production will give little trouble. A well-designed database 'just works'. There are a small number of mistakes in database design that causes subsequent misery to developers, managewrs, and DBAs alike. Here are the ten worst mistakes

No list of mistakes is ever going to be exhaustive. People (myself included) do a lot of really stupid things, at times, in the name of “getting it done.” This list simply reflects the database design mistakes that are currently on my mind, or in some cases, constantly on my mind. I have done this topic two times before. If you’re interested in hearing the podcast version, visit Greg Low’s super-excellent SQL Down Under. I also presented a boiled down, ten-minute version at PASS for the Simple-Talk booth. Originally there were ten, then six, and today back to ten. And these aren’t exactly the same ten that I started with; these are ten that stand out to me as of today.

Before I start with the list, let me be honest for a minute. I used to have a preacher who made sure to tell us before some sermons that he was preaching to himself as much as he was to the congregation. When I speak, or when I write an article, I have to listen to that tiny little voice in my head that helps filter out my own bad habits, to make sure that I am teaching only the best practices. Hopefully, after reading this article, the little voice in your head will talk to you when you start to stray from what is right in terms of database design practices.

So, the list:

Poor design/planning

Ignoring normalization

Poor naming standards

Lack of documentation

One table to hold all domain values

Using identity/guid columns as your only key

Not using SQL facilities to protect data integrity

Not using stored procedures to access data

Trying to build generic objects

Lack of testing

Poor design/planning

“If you don’t know where you are going, any road will take you there” – George Harrison

Prophetic words for all parts of life and a description of the type of issues that plague many projects these days.

Let me ask you: would you hire a contractor to build a house and then demand that they start pouring a foundation the very next day? Even worse, would you demand that it be done without blueprints or house plans? Hopefully, you answered “no” to both of these. A design is needed make sure that the house you want gets built, and that the land you are building it on will not sink into some underground cavern. If you answered yes, I am not sure if anything I can say will help you.

Like a house, a good database is built with forethought, and with proper care and attention given to the needs of the data that will inhabit it; it cannot be tossed together in some sort of reverse implosion.

Since the database is the cornerstone of pretty much every business project, if you don’t take the time to map out the needs of the project and how the database is going to meet them, then the chances are that the whole project will veer off course and lose direction. Furthermore, if you don’t take the time at the start to get the database design right, then you’ll find that any substantial changes in the database structures that you need to make further down the line could have a huge impact on the whole project, and greatly increase the likelihood of the project timeline slipping.

Far too often, a proper planning phase is ignored in favor of just “getting it done”. The project heads off in a certain direction and when problems inevitably arise – due to the lack of proper designing and planning – there is “no time” to go back and fix them properly, using proper techniques. That’s when the “hacking” starts, with the veiled promise to go back and fix things later, something that happens very rarely indeed.

Admittedly it is impossible to predict every need that your design will have to fulfill and every issue that is likely to arise, but it is important to mitigate against potential problems as much as possible, by careful planning.

Ignoring Normalization

Normalization defines a set of methods to break down tables to their constituent parts until each table represents one and only one “thing”, and its columns serve to fully describe only the one “thing” that the table represents.

The concept of normalization has been around for 30 years and is the basis on which SQL and relational databases are implemented. In other words, SQL was created to work with normalized data structures. Normalization is not just some plot by database programmers to annoy application programmers (that is merely a satisfying side effect!)

SQL is very additive in nature in that, if you have bits and pieces of data, it is easy to build up a set of values or results. In the FROM clause, you take a set of data (a table) and add (JOIN) it to another table. You can add as many sets of data together as you like, to produce the final set you need.

This additive nature is extremely important, not only for ease of development, but also for performance. Indexes are most effective when they can work with the entire key value. Whenever you have to use SUBSTRING, CHARINDEX, LIKE, and so on, to parse out a value that is combined with other values in a single column (for example, to split the last name of a person out of a full name column) the SQL paradigm starts to break down and data becomes become less and less searchable.

So normalizing your data is essential to good performance, and ease of development, but the question always comes up: “How normalized is normalized enough?” If you have read any books about normalization, then you will have heard many times that 3rd Normal Form is essential, but 4th and 5th Normal Forms are really useful and, once you get a handle on them, quite easy to follow and well worth the time required to implement them.

In reality, however, it is quite common that not even the first Normal Form is implemented correctly.

Whenever I see a table with repeating column names appended with numbers, I cringe in horror. And I cringe in horror quite often. Consider the following example Customer table:

Are there always 12 payments? Is the order of payments significant? Does a NULL value for a payment mean UNKNOWN (not filled in yet), or a missed payment? And when was the payment made?!?

A payment does not describe a Customer and should not be stored in the Customer table. Details of payments should be stored in a Payment table, in which you could also record extra information about the payment, like when the payment was made, and what the payment was for:

In this second design, each column stores a single unit of information about a single “thing” (a payment), and each row represents a specific instance of a payment.

This second design is going to require a bit more code early in the process but, it is far more likely that you will be able to figure out what is going on in the system without having to hunt down the original programmer and kick their butt…sorry… figure out what they were thinking

Poor naming standards

“That which we call a rose, by any other name would smell as sweet“

This quote from Romeo and Juliet by William Shakespeare sounds nice, and it is true from one angle. If everyone agreed that, from now on, a rose was going to be called dung, then we could get over it and it would smell just as sweet. The problem is that if, when building a database for a florist, the designer calls it dung and the client calls it a rose, then you are going to have some meetings that sound far more like an Abbott and Costello routine than a serious conversation about storing information about horticulture products.

Names, while a personal choice, are the first and most important line of documentation for your application. I will not get into all of the details of how best to name things here- it is a large and messy topic. What I want to stress in this article is the need for consistency. The names you choose are not just to enable you to identify the purpose of an object, but to allow all future programmers, users, and so on to quickly and easily understand how a component part of your database was intended to be used, and what data it stores. No future user of your design should need to wade through a 500 page document to determine the meaning of some wacky name.

Consider, for example, a column named, X304_DSCR. What the heck does that mean? You might decide, after some head scratching, that it means “X304 description”. Possibly it does, but maybe DSCR means discriminator, or discretizator?

Unless you have established DSCR as a corporate standard abbreviation for description, then X304_DESCRIPTION is a much better name, and one leaves nothing to the imagination.

That just leaves you to figure out what the X304 part of the name means. On first inspection, to me, X304 sounds like more like it should be data in a column rather than a column name. If I subsequently found that, in the organization, there was also an X305 and X306 then I would flag that as an issue with the database design. For maximum flexibility, data is stored in columns, not in column names.

Along these same lines, resist the temptation to include “metadata” in an object’s name. A name such as tblCustomer or colVarcharAddress might seem useful from a development perspective, but to the end user it is just confusing. As a developer, you should rely on being able to determine that a table name is a table name by context in the code or tool, and present to the users clear, simple, descriptive names, such as Customer and Address.

A practice I strongly advise against is the use of spaces and quoted identifiers in object names. You should avoid column names such as “Part Number” or, in Microsoft style, [Part Number], therefore requiring you users to include these spaces and identifiers in their code. It is annoying and simply unnecessary.

Acceptable alternatives would be part_number, partNumber or PartNumber. Again, consistency is key. If you choose PartNumber then that’s fine – as long as the column containing invoice numbers is called InvoiceNumber, and not one of the other possible variations.

Lack of documentation

I hinted in the intro that, in some cases, I am writing for myself as much as you. This is the topic where that is most true. By carefully naming your objects, columns, and so on, you can make it clear to anyone what it is that your database is modeling. However, this is only step one in the documentation battle. The unfortunate reality is, though, that “step one” is all too often the only step.

Not only will a well-designed data model adhere to a solid naming standard, it will also contain definitions on its tables, columns, relationships, and even default and check constraints, so that it is clear to everyone how they are intended to be used. In many cases, you may want to include sample values, where the need arose for the object, and anything else that you may want to know in a year or two when “future you” has to go back and make changes to the code.

NOTE:Where this documentation is stored is largely a matter of corporate standards and/or convenience to the developer and end users. It could be stored in the database itself, using extended properties. Alternatively, it might be in maintained in the data modeling tools. It could even be in a separate data store, such as Excel or another relational database. My company maintains a metadata repository database, which we developed in order to present this data to end users in a searchable, linkable format. Format and usability is important, but the primary battle is to have the information available and up to date.

Your goal should be to provide enough information that when you turn the database over to a support programmer, they can figure out your minor bugs and fix them (yes, we all make bugs in our code!). I know there is an old joke that poorly documented code is a synonym for “job security.” While there is a hint of truth to this, it is also a way to be hated by your coworkers and never get a raise. And no good programmer I know of wants to go back and rework their own code years later. It is best if the bugs in the code can be managed by a junior support programmer while you create the next new thing. Job security along with raises is achieved by being the go-to person for new challenges.

One table to hold all domain values

“One Ring to rule them all and in the darkness bind them“

This is all well and good for fantasy lore, but it’s not so good when applied to database design, in the form of a “ruling” domain table. Relational databases are based on the fundamental idea that every object represents one and only one thing. There should never be any doubt as to what a piece of data refers to. By tracing through the relationships, from column name, to table name, to primary key, it should be easy to examine the relationships and know exactly what a piece of data means.

The big myth perpetrated by architects who don’t really understand relational database architecture (me included early in my career) is that the more tables there are, the more complex the design will be. So, conversely, shouldn’t condensing multiple tables into a single “catch-all” table simplify the design? It does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good idea too.

For example, consider the following model snippet where I needed domain values for:

Customer CreditStatus

Customer Type

Invoice Status

Invoice Line Item BackOrder Status

Invoice Line Item Ship Via Carrier

On the face of it that would be five domain tables…but why not just use one generic domain table, like this?

This may seem a very clean and natural way to design a table for all but the problem is that it is just not very natural to work with in SQL. Say we just want the domain values for the Customer table:

As you can see, this is far from being a natural join. It comes down to the problem of mixing apples with oranges. At first glance, domain tables are just an abstract concept of a container that holds text. And from an implementation centric standpoint, this is quite true, but it is not the correct way to build a database. In a database, the process of normalization, as a means of breaking down and isolating data, takes every table to the point where one row represents one thing. And each domain of values is a distinctly different thing from all of the other domains (unless it is not, in which case the one table will suffice.). So what you do, in essence, is normalize the data on each usage, spreading the work out over time, rather than doing the task once and getting it over with.

So instead of the single table for all domains, you might model it as:

Looks harder to do, right? Well, it is initially. Frankly it took me longer to flesh out the example tables. But, there are quite a few tremendous gains to be had:

Data can be validated using foreign key constraints very naturally, something not feasible for the other solution unless you implement ranges of keys for every table – a terrible mess to maintain.

If it turns out that you need to keep more information about a ShipViaCarrier than just the code, ‘UPS’, and description, ‘United Parcel Service’, then it is as simple as adding a column or two. You could even expand the table to be a full blown representation of the businesses that are carriers for the item.

All of the smaller domain tables will fit on a single page of disk. This ensures a single read (and likely a single page in cache). If the other case, you might have your domain table spread across many pages, unless you cluster on the referring table name, which then could cause it to be more costly to use a non-clustered index if you have many values.

You can still have one editor for all rows, as most domain tables will likely have the same base structure/usage. And while you would lose the ability to query all domain values in one query easily, why would you want to? (A union query could easily be created of the tables easily if needed, but this would seem an unlikely need.)

I should probably rebut the thought that might be in your mind. “What if I need to add a new column to all domain tables?” For example, you forgot that the customer wants to be able to do custom sorting on domain values and didn’t put anything in the tables to allow this. This is a fair question, especially if you have 1000 of these tables in a very large database. First, this rarely happens, and when it does it is going to be a major change to your database in either way.

Second, even if this became a task that was required, SQL has a complete set of commands that you can use to add columns to tables, and using the system tables it is a pretty straightforward task to build a script to add the same column to hundreds of tables all at once. That will not be as easy of a change, but it will not be so much more difficult to outweigh the large benefits.

The point of this tip is simply that it is better to do the work upfront, making structures solid and maintainable, rather than trying to attempt to do the least amount of work to start out a project. By keeping tables down to representing one “thing” it means that most changes will only affect one table, after which it follows that there will be less rework for you down the road.

Using identity/guid columns as your only key

First Normal Form dictates that all rows in a table must be uniquely identifiable. Hence, every table should have a primary key. SQL Server allows you to define a numeric column as an IDENTITY column, and then automatically generates a unique value for each row. Alternatively, you can use NEWID() (or NEWSEQUENTIALID()) to generate a random, 16 byte unique value for each row. These types of values, when used as keys, are what are known as surrogate keys. The word surrogate means “something that substitutes for” and in this case, a surrogate key should be the stand-in for a natural key.

The problem is that too many designers use a surrogate key column as the only key column on a given table. The surrogate key values have no actual meaning in the real world; they are just there to uniquely identify each row.

Now, consider the following Part table, whereby PartID is an IDENTITY column and is the primary key for the table:

PartID

PartNumber

Description

1

XXXXXXXX

The X part

2

XXXXXXXX

The X part

3

YYYYYYYY

The Y part

How many rows are there in this table? Well, there seem to be three, but are rows with PartIDs 1 and 2 actually the same row, duplicated? Or are they two different rows that should be unique but were keyed in incorrectly?

The rule of thumb I use is simple. If a human being could not pick which row they want from a table without knowledge of the surrogate key, then you need to reconsider your design. This is why there should be a key of some sort on the table to guarantee uniqueness, in this case likely on PartNumber.

In summary: as a rule, each of your tables should have a natural key that means something to the user, and can uniquely identify each row in your table. In the very rare event that you cannot find a natural key (perhaps, for example, a table that provides a log of events), then use an artificial/surrogate key.

Not using SQL facilities to protect data integrity

All fundamental, non-changing business rules should be implemented by the relational engine. The base rules of nullability, string length, assignment of foreign keys, and so on, should all be defined in the database.

There are many different ways to import data into SQL Server. If your base rules are defined in the database itself can you guarantee that they will never be bypassed and you can write your queries without ever having to worry whether the data you’re viewing adheres to the base business rules.

Rules that are optional, on the other hand, are wonderful candidates to go into a business layer of the application. For example, consider a rule such as this: “For the first part of the month, no part can be sold at more than a 20% discount, without a manager’s approval”.

Taken as a whole, this rule smacks of being rather messy, not very well controlled, and subject to frequent change. For example, what happens when next week the maximum discount is 30%? Or when the definition of “first part of the month” changes from 15 days to 20 days? Most likely you won’t want go through the difficulty of implementing these complex temporal business rules in SQL Server code – the business layer is a great place to implement rules like this.

However, consider the rule a little more closely. There are elements of it that will probably never change. E.g.

The maximum discount it is ever possible to offer

The fact that the approver must be a manager

These aspects of the business rule very much ought to get enforced by the database and design. Even if the substance of the rule is implemented in the business layer, you are still going to have a table in the database that records the size of the discount, the date it was offered, the ID of the person who approved it, and so on. On the Discount column, you should have a CHECK constraint that restricts the values allowed in this column to between 0.00 and 0.90 (or whatever the maximum is). Not only will this implement your “maximum discount” rule, but will also guard against a user entering a 200% or a negative discount by mistake. On the ManagerID column, you should place a foreign key constraint, which reference the Managers table and ensures that the ID entered is that of a real manager (or, alternatively, a trigger that selects only EmployeeIds corresponding to managers).

Now, at the very least we can be sure that the data meets the very basic rules that the data must follow, so we never have to code something like this in order to check that the data is good:

1

We can feel safe that data meets the basic criteria, every time.

Not using stored procedures to access data

Stored procedures are your friend. Use them whenever possible as a method to insulate the database layer from the users of the data. Do they take a bit more effort? Sure, initially, but what good thing doesn’t take a bit more time? Stored procedures make database development much cleaner, and encourage collaborative development between your database and functional programmers. A few of the other interesting reasons that stored procedures are important include the following.

Maintainability

Stored procedures provide a known interface to the data, and to me, this is probably the largest draw. When code that accesses the database is compiled into a different layer, performance tweaks cannot be made without a functional programmer’s involvement. Stored procedures give the database professional the power to change characteristics of the database code without additional resource involvement, making small changes, or large upgrades (for example changes to SQL syntax) easier to do.

Encapsulation

Stored procedures allow you to “encapsulate” any structural changes that you need to make to the database so that the knock on effect on user interfaces is minimized. For example, say you originally modeled one phone number, but now want an unlimited number of phone numbers. You could leave the single phone number in the procedure call, but store it in a different table as a stopgap measure, or even permanently if you have a “primary” number of some sort that you always want to display. Then a stored proc could be built to handle the other phone numbers. In this manner the impact to the user interfaces could be quite small, while the code of stored procedures might change greatly.

Security

Stored procedures can provide specific and granular access to the system. For example, you may have 10 stored procedures that all update table X in some way. If a user needs to be able to update a particular column in a table and you want to make sure they never update any others, then you can simply grant to that user the permission to execute just the one procedure out of the ten that allows them perform the required update.

Performance

There are a couple of reasons that I believe stored procedures enhance performance. First, if a newbie writes ratty code (like using a cursor to go row by row through an entire ten million row table to find one value, instead of using a WHERE clause), the procedure can be rewritten without impact to the system (other than giving back valuable resources.) The second reason is plan reuse. Unless you are using dynamic SQL calls in your procedure, SQL Server can store a plan and not need to compile it every time it is executed. It’s true that in every version of SQL Server since 7.0 this has become less and less significant, as SQL Server gets better at storing plans ad hoc SQL calls (see note below). However, stored procedures still make it easier for plan reuse and performance tweaks. In the case where ad hoc SQL would actually be faster, this can be coded into the stored procedure seamlessly.

In 2005, there is a database setting (PARAMETERIZATION FORCED) that, when enabled, will cause all queries to have their plans saved. This does not cover more complicated situations that procedures would cover, but can be a big help. There is also a feature known as plan guides, which allow you to override the plan for a known query type. Both of these features are there to help out when stored procedures are not used, but stored procedures do the job with no tricks.

And this list could go on and on. There are drawbacks too, because nothing is ever perfect. It can take longer to code stored procedures than it does to just use ad hoc calls. However, the amount of time to design your interface and implement it is well worth it, when all is said and done.

Trying to code generic T-SQL objects

I touched on this subject earlier in the discussion of generic domain tables, but the problem is more prevalent than that. Every new T-SQL programmer, when they first start coding stored procedures, starts to think “I wish I could just pass a table name as a parameter to a procedure.” It does sound quite attractive: one generic stored procedure that can perform its operations on any table you choose. However, this should be avoided as it can be very detrimental to performance and will actually make life more difficult in the long run.

T-SQL objects do not do “generic” easily, largely because lots of design considerations in SQL Server have clearly been made to facilitate reuse of plans, not code. SQL Server works best when you minimize the unknowns so it can produce the best plan possible. The more it has to generalize the plan, the less it can optimize that plan.

Note that I am not specifically talking about dynamic SQL procedures. Dynamic SQL is a great tool to use when you have procedures that are not optimizable / manageable otherwise. A good example is a search procedure with many different choices. A precompiled solution with multiple OR conditions might have to take a worst case scenario approach to the plan and yield weak results, especially if parameter usage is sporadic.

However, the main point of this tip is that you should avoid coding very generic objects, such as ones that take a table name and twenty column names/value pairs as a parameter and lets you update the values in the table. For example, you could write a procedure that started out:

The idea would be to dynamically specify the name of a column and the value to pass to a SQL statement. This solution is no better than simply using ad hoc calls with an UPDATE statement. Instead, when building stored procedures, you should build specific, dedicated stored procedures for each task performed on a table (or multiple tables.) This gives you several benefits:

Properly compiled stored procedures can have a single compiled plan attached to it and reused.

Properly compiled stored procedures are more secure than ad-hoc SQL or even dynamic SQL procedures, reducing the surface area for an injection attack greatly because the only parameters to queries are search arguments or output values.

Testing and maintenance of compiled stored procedures is far easier to do since you generally have only to search arguments, not that tables/columns/etc exist and handling the case where they do not

A nice technique is to build a code generation tool in your favorite programming language (even T-SQL) using SQL metadata to build very specific stored procedures for every table in your system. Generate all of the boring, straightforward objects, including all of the tedious code to perform error handling that is so essential, but painful to write more than once or twice.

In my Apress book, Pro SQL Server 2005 Database Design and Optimization, I provide several such “templates” (manly for triggers, abut also stored procedures) that have all of the error handling built in, I would suggest you consider building your own (possibly based on mine) to use when you need to manually build a trigger/procedure or whatever.

Lack of testing

When the dial in your car says that your engine is overheating, what is the first thing you blame? The engine. Why don’t you immediately assume that the dial is broken? Or something else minor? Two reasons:

The engine is the most important component of the car and it is common to blame the most important part of the system first.

It is all too often true.

As database professionals know, the first thing to get blamed when a business system is running slow is the database. Why? First because it is the central piece of most any business system, and second because it also is all too often true.

We can play our part in dispelling this notion, by gaining deep knowledge of the system we have created and understanding its limits through testing.

But let’s face it; testing is the first thing to go in a project plan when time slips a bit. And what suffers the most from the lack of testing? Functionality? Maybe a little, but users will notice and complain if the “Save” button doesn’t actually work and they cannot save changes to a row they spent 10 minutes editing. What really gets the shaft in this whole process is deep system testing to make sure that the design you (presumably) worked so hard on at the beginning of the project is actually implemented correctly.

But, you say, the users accepted the system as working, so isn’t that good enough? The problem with this statement is that what user acceptance “testing” usually amounts to is the users poking around, trying out the functionality that they understand and giving you the thumbs up if their little bit of the system works. Is this reasonable testing? Not in any other industry would this be vaguely acceptable. Do you want your automobile tested like this? “Well, we drove it slowly around the block once, one sunny afternoon with no problems; it is good!” When that car subsequently “failed” on the first drive along a freeway, or during the first drive through rain or snow, then the driver would have every right to be very upset.

Too many database systems get tested like that car, with just a bit of poking around to see if individual queries and modules work. The first real test is in production, when users attempt to do real work. This is especially true when it is implemented for a single client (even worse when it is a corporate project, with management pushing for completion more than quality).

Initially, major bugs come in thick and fast, especially performance related ones. If the first time you have tried a full production set of users, background process, workflow processes, system maintenance routines, ETL, etc, is on your system launch day, you are extremely likely to discover that you have not anticipated all of the locking issues that might be caused by users creating data while others are reading it, or hardware issues cause by poorly set up hardware. It can take weeks to live down the cries of “SQL Server can’t handle it” even after you have done the proper tuning.

Once the major bugs are squashed, the fringe cases (which are pretty rare cases, like a user entering a negative amount for hours worked) start to raise their ugly heads. What you end up with at this point is software that irregularly fails in what seem like weird places (since large quantities of fringe bugs will show up in ways that aren’t very obvious and are really hard to find.)

Now, it is far harder to diagnose and correct because now you have to deal with the fact that users are working with live data and trying to get work done. Plus you probably have a manager or two sitting on your back saying things like “when will it be done?” every 30 seconds, even though it can take days and weeks to discover the kinds of bugs that result in minor (yet important) data aberrations. Had proper testing been done, it would never have taken weeks of testing to find these bugs, because a proper test plan takes into consideration all possible types of failures, codes them into an automated test, and tries them over and over. Good testing won’t find all of the bugs, but it will get you to the point where most of the issues that correspond to the original design are ironed out.

If everyone insisted on a strict testing plan as an integral and immutable part of the database development process, then maybe someday the database won’t be the first thing to be fingered when there is a system slowdown.

Summary

Database design and implementation is the cornerstone of any data centric project (read 99.9% of business applications) and should be treated as such when you are developing. This article, while probably a bit preachy, is as much a reminder to me as it is to anyone else who reads it. Some of the tips, like planning properly, using proper normalization, using a strong naming standards and documenting your work- these are things that even the best DBAs and data architects have to fight to make happen. In the heat of battle, when your manager’s manager’s manager is being berated for things taking too long to get started, it is not easy to push back and remind them that they pay you now, or they pay you later. These tasks pay dividends that are very difficult to quantify, because to quantify success you must fail first. And even when you succeed in one area, all too often other minor failures crop up in other parts of the project so that some of your successes don’t even get noticed.

The tips covered here are ones that I have picked up over the years that have turned me from being mediocre to a good data architect/database programmer. None of them take extraordinary amounts of time (except perhaps design and planning) but they all take more time upfront than doing it the “easy way”. Let’s face it, if the easy way were that easy in the long run, I for one would abandon the harder way in a second. It is not until you see the end result that you realize that success comes from starting off right as much as finishing right.

SQL Monitor helps you keep track of your SQL Server performance, and if something does go wrong it gives you the answers to find and fix problems fast.

Subscribe for more articles

Subscribe to our fortnightly newsletter

Louis has been in the IT industry for over 20 years as a corporate database developer and data architect. Currently he is the Data Architect for CBN in Virginia Beach. Louis has been a Microsoft MVP since 2004, and is an active volunteer for the PASS locally and globally. He is the author of a series of SQL Server Database Design books, most recently Pro SQL Server Relational Database Design and Implementation.

Reply to more comments
Thanks for the comments. So many of them are reasonable it is great. I disagree with a few, but I understand your points too….

>>(I wouldn’t dare sell a product which relied on proprietary DB’s)<< >>stored procs are a pain, if not extinct.<<

I agree with you in essense about stored procedures, except for one thing. All of database systems are different, and have minor differences. Putting things into stored procedures may take more work but it will pay off in the end.

I have more to say about procedures, after a few more replies…

>>Store procedures are NOT a best practice. Implementing too much business logic in them will result in a poorly maintainable system. You can get better encapsulation by implementing a proper business model, and placing the business logic in there.<<

And the difference is? I have yet to see a well maintained middle layer of logic that made it reasonable to tune and maintain the DATA oriented code. I agree 100% that business logic shouldn’t be done in procedures, just data oriented code and constraints that deal with the ALWAYS rules. If a column must always be non-null, then the column is built as NOT NULL. If a column must always be > 100, then a constraint is applied. Whenever you build a function that builds a SQL Query, you are putting data logic in a difficult place. Just build a procedure with parameters.

>>Let’s face it. Most general purpose programming languages (GPL) are significantly more powerful than store procedures<<

This is the big lie. It is also very very true, hence the confusion. T-SQL (and other relational languages are not good programming languages for working with data individually. Anytime you have to work with one thing at a time, T-SQL is slow and clunky. On the other hand, where functional programming languages fall apart is working with thousands of rows. All I advocate is using the right tools in the right place.

>>That’s not even getting into the mapping inconsistencies between your object model and procedure create. <<

This is generally a lack of creativity where people try to only build procedures that match one to one with tables. And if you have a well normalized set of tables, things get even more messy

>>Stored procedures are a left over from 1990 architecture modeling. It’s much more effecient to use ORM/Data Mapping tools to develop your code.<<

Efficient is the other challenging subject that comes up all of the time. Yes, it is more efficient in human terms to use tools to do your work. But the problem is when the system is in production and you have to manage it. Systems built on stored procedures are a dream to work with. This procedure is slow, well, let’s tune it.

When SQL is built in layers of application code, along with data protection code, you can spend hours debugging and tuning the application, just trying to locate the problem. And if you use a single row at a time mentality that many tools seem to encourage, then in a larger system you are going to be sifting through hundreds and hundreds of calls where one or two would have sufficed.

>>All the examples given to support NOT using stored procedures can be satisfied with non-store procedure solutions.<<

Sure, I don’t dispute that stored procedures are not a requirement. Just because something can be done adequately in one manner doesn’t mean that something else is better. You can build a house with your bare hands if you try hard enough, but with a hammer and a saw it is easier. That is not a perfect analogy, but the point is, stored procedures, when done right, are far easier to work with than is code embedded in the application.

>>performance wise wouldnt it be better to do this in application level. application languages like C# would out perform t-sql wouldnt it? <<

It depends on what the code does. Using stored procedures is not an advocation of building client server applications. I don’t want to go back to the elaborate stored procedures we once wrote to do everything in the database 10+ years ago. All I advocate is that where your application might build a string like:

SELECT column1, column2
FROM table
WHERE column3 = 2

A stored procedure could be built such that the call is simply:

table$retreiveAll @column3 = 2

Now, if I need to optimize that call somehow, it is accessible. It is easier to build each of these things in their own proper languages, instead of building code on the fly for the SQL. I mean, if you are a good C programmer (any flavor) you probably clear all of your warnings before shipping, much less generating part of the code on the fly when the client pushes a button. Why not do the same for the data-oriented code? Procedures are just another layer of abstraction enabling data programming to go on using data oriented tools.

You should use the functional languages for anything that is not set based in nature. Like if column2 needs to be all uppercase, it would be best to do that in the data layer, Or the date formatted YYYYMMDD, etc.

Look, it is not that database people want to take over and do all of the work. Not at all. It is just that when the rubber hits the road and the application is running slowly, who are you going to call? The DBA. And they will be charged with discovering what is going on. If something is wrong in the data layer, the dba can change things and get things up and running in no time.

>>There is some debate in my project on naming conventions. One thing that I’d like to get your perspective on: is there any reason why one should not have an undercore in names? <<

Good luck. Naming conventions are freaking annoying to argue about, because there are no clear answers. I have actually started to like underscores again because they are far more readably and easily parsed by tools. I have no good answer for you 🙂

>>indexing on columns that have complex data and/or data types is not a good idea. sql will always retrieve records quicker when searching on an indexed field that is of a simple type like an IDENT key field. Yes or No?<<

To call this an oversimplication is a disservice to the word. I more or less agree with you that it is just the simplest pattern to use to always just use an identity column for the key. ANd if you are always fetching things by the key (like so many ORM style solutions would, though likely with a GUID) then this is going to be your best bet. But there is a lot of debate about this. For examples, check out Greg Linwood’s blog:
blogs.sqlserver.org.au/blogs/greg_linwood (or search google with this link: http://www.google.com/search?hl=en&q=clustered+site%3Ablogs.sqlserver.org.au%2Fblogs%2Fgreg_linwood) He has an interesting series of posts about why clustered indexes may not be the best thing since sliced bread.

amarjitsinghkullar

Amarjit Singh Kullar
This information about the database is very useful. For beginner like me. This would helpful to make a gud kick start. Since these type of technical stuffs would create interests within the beginners to learn more and perform more in this field of technology

Amarjit Singh Kullar

Anonymous

One minor disagreement
You say

“In summary: as a rule, each of your tables should have a natural key that means something to the user, and can uniquely identify each row in your table. In the very rare event that you cannot find a natural key (perhaps, for example, a table that provides a log of events), then use an artificial/surrogate key.”

I would almost agree with that, but I would suggest that using an Identity column and another “person readable” unique key can provide benefits.

For example, Supplier A changes their part number for Part x. Now when you order Part X you need to quote their part number. By having an Identity column used for foreign keys, the new unique part number can be included in all purchase orders.

The problem comes when that is the only key, not in the use of identity keys, one of the reasons being the one you stated about changing the natural key.

Anonymous

RE: One minor disagreement
If you read that I don’t like surrogate (identity) primary keys then I might need to make some edits. The point of the section was that you needed to *at least* have a natural key. I always set the primary key on every table that I create to be an integer, usually set by the identity property.

The problem comes when that is the only key, not in the use of identity keys, one of the reasons being the one you stated about changing the natural key.

Thanks for reading!

ASdanz

Surrogate Keys
I, too, initially misread your opinion of identities as surrogate keys. I read the title of that section as:

Using identity/guid columns as your only [type of primary] key.

I completely agree on the need for stable keys. Most of the time, Identities are much easier to implement and manage. Although purists would say that you are shirking your duty as a DBA to find a stable natural key.

Thanks for the article.

Anonymous

except stored procedures:-)
Fully agree on all stuff except stored procedures. SP are a good practice, but they do have shortcomings:
1. can’t be ported easily
if your application must work againts
another db vendor, stay away from sp
2. must sometimes use dynamic sql
Complex dynamic queries are much easier
built in another language (or use a .NET SP)
3. can’t receive arrays of parameters
(XML parameters don’t count 🙂
4. ackward for some updates
If you need to update only a field but do not know which one in advance, binding 40 parameters when you need one + the PK is a bit stupid and quite innefficient.

I DO recommend stored procedures, but they are not always a best practice

Anonymous

Good Article, Nice Use of Quotes
The one from Lord of the Rings is the best. I have a client that has bought software built around the the “One Ring” domain idea, and not just for domains, but for all the business data, too.

Anonymous

Column names
I disagree on your assertion that column names should be end-user friendly. On applications that must be prepared to be translated to several languages (and most commercial applications should be), displaying the column name directly to the user isn’t an option, and so names that are more meaningful to the developer can and should be used.

Anonymous

More on surrogate keys
I think the problem with “Using identity/guid columns as your only key” is that your described mistake isn’t possible. The candidate keys (the full list of possible keys on the table including the natural keys for those who aren’t familiar with the term) are always going to be keys. You can’t forget to make them keys, they just are. You should rename the section “Not putting at least a unique constaint on all candidate keys” or something like that. Once you have the title right the rewrite should be easier. Oh, don’t forget that the other common database design mistake is NOT using a surrogate key as the primary key.

Anonymous

Use of identity columns in foreign key references
Outstanding article. Most of your guidelines are indisputable.

I often see IDENTITY columns used improperly and I’d be interested to get your thoughts. I was chief architect for Sybase back in the early 1990s when we first added identity columns to SQL Server, so I have a bit of history with this topic.

At the time, we added identity columns to SQL Server (this was before the Microsoft-Sybase divorce when Sybase did all of the development on SQL Server) because Oracle had a counterpart called ROWID columns. We couldn’t call them ROWID columns because, as you probably know, SQL Server moves rows around from page to page whereas Oracle doesn’t. Hence the row id in SQL Server is constantly changing, even when you do certain types of UPDATE statements. In short, we added IDENTITY columns because we needed to be feature-compatible with Oracle and because application vendors like Peoplesoft and SAP demanded it.

Our design guideline at Sybase was that you only use IDENTITY columns when you truly do not care what the value in that field is. That means that if you BCP the contents of the table from one server to another and all the values in that table end up different, you’re okay with it. Most of the time, that’s not the case, particularly when there are foreign key references to the column on which you used IDENTITY.

What happens if you use BCP to move data between databases? BCP doesn’t override the identity column which means that all the values might change and your foreign key references will all be broken.

The “purist” design approach is to use IDENTITY columns as a primary key but don’t allow foreign key references to it. Make all foreign key references to the “natural” key and use the identity column as a surrogate.

Anonymous

Nuances
Great article! I fully agree with most of what you’re saying. Depending on the system you’re designing and the environment you’re designing for some nuances might be in order ofcourse but that doesn’t matter.

The point about normalization is a good one but sometimes it’s nescesary to denormalize a little again for performance’s sake.

In reply to a poster before me, the argument for not using IDENTITY-columns as a primary key is not valid. Never BCP into “life” tables! Always use an intermediate table to BCP in (prefferably with varchar-columns) then process and check(!) the data before inserting into the final target table. You can use the natural key to do that. If there’s really no way around BCP-ing into a “life” table then use the SET IDENTITY_INSERT ON statement (but be very very very careful).

BugEyedMonster

One addition to naming conventions
I will add my personal pet peeve into the naming convention, in part it is included with the don’t name things as table, varchar, etc…

Imaging a database table called servletDatabase, that’s a good name right? WRONG! The conversation went something like this:
P1: That data is stored in servletDatabase
Me: Which server has the servlet database?
P1: It is in the production database.
Me: (Thinking, a database in a database?) Hey I found a servlet table, but it doesn’t have the data I want.
P1: The data is in the table servletDatabase.
Me: Of course, I should have realized that.(NOT!)

Since databases are made up of multiple components, servers, databases, tables, columns, I would expand the rule to include don’t name two or more different components the same name. Why? What happens when you have a server named playpen with a database named playpen and you have to rehost the playpen database, or when someone asks for a new server so they can get another database besides playpen.

Finally can you imagine telling someone to look in the name column in name database, running on the name server?

Anonymous

“purist”
What is the definition of a “purist” anyway? The definition of a “purist” that I know of says that the foreign key should never be based on natural keys, only surrogate keys. That is because natural keys mean something in the real world and therefore can change.

As for having to BCP data around, that is when you switch to a GUID (or equivalent) surrogate key which is designed to be universally unique.

Anonymous

Saving Problem
Thank you for this grear subject.
I have a project with two tables….1. Customers
2. Orders
My problem is that when I try to display the data on the form in “details”, I can edit and save data in the main Customers table, but the related Orders table is not responding. It removes any new data I try to enter in any control as a textbox.

If the data is displayed in datagridview for the Orders table, it works although I can add a new recor but whenever I try to edit anything I got an “Update failed” mesage.

>>must sometimes use dynamic sql<<
true enough, but it still makes performance tuning easier if you know what might be dynamically generated. CLR proc is a good idea here.

>>ackward for some updates<<
The only reason I slightly disagree with you there is that it takes longer to write the middle tier code to determine what has changed rather than just send all columns. And it usually isn’t that expensive to update extra columns when setting the value to the same thing it was.

2. Column names

>>and so names that are more meaningful to the developer can and should be used<<
I can see that, sort of, but all of the gobbly gook that gets added really has no added value. TblFred? ColNameString? And do you include the length too? Too easy to query the metadata to waste time adding metadata to the name, in my opinion.

3. More on surrogate keys

>>The candidate keys (the full list of possible keys on the table including the natural keys for those who aren’t familiar with the term) are always going to be keys.<<

You are being ironically nieve. Too too too many times people just have an identity value, make it the primary key and they are done. So many forum posts start: I have duplicate key values…

>>Oh, don’t forget that the other common database design mistake is NOT using a surrogate key as the primary key.<<

I disagree here. You can choose to use natural keys for primary keys if you want, there is nothing wrong with it. I personally far identities, but there is noting wrong with it at all.

4. Use of identity columns in foreign key references

>>Our design guideline at Sybase was that you only use IDENTITY columns when you truly do not care what the value in that field is. <<

Amen to that my friend.

>>The “purist” design approach is to use IDENTITY columns as a primary key but don’t allow foreign key references to it. Make all foreign key references to the “natural” key and use the identity column as a surrogate.<<

In SQL Server, we can keep identity values when moving data around. But the question to me isn’t about identity values, but of using surrogates. I want the FK’s to be meaningless too, because it makes joins faster since relationships are built on small values rather than larger ones. And the PK being a clustered key helps joins out too.

5. “purist”

>>The definition of a “purist” that I know of says that the foreign key should never be based on natural keys, only surrogate keys.<<

I would disagree here. I would expect a lot of “purist”s to cringe at the idea of using surrogates at all (I know of a certain purist that would, for sure). I agree with you that I never like keys with values that can change.

6. One addition to naming conventions

>>Since databases are made up of multiple components, servers, databases, tables, columns, I would expand the rule to include don’t name two or more different components the same name. <<

That is a good idea, but I think that for the most part this is not a problem if you have named your objects to match what they are. Also, I usually speak of things in terms of what they are.

>>Finally can you imagine telling someone to look in the name column in name database, running on the name server?<<

If it did, the way you put it makes it non-ambiguous what you mean. Is this likely to happen like this? Probably not, since you will be naming your objects in a manner that makes sense, and it is unlikely that the database, table, and column would all have the same name..

My last book had a magnificant crew working behind the scenes, which included the editor of this site. He was simply excellent to work with, which is why I came back here.

As for the face remark…yeah, that is something I have to work on (I wish it was just the face…) I have become obsessed with the home office of late and spend too much time sitting when I should be out doing (like right now 🙂

Anonymous

Excellent Article
I have been studying normalization for some time now and have been fighting the concept. A search for a clearer understanding is what brought me here.

Now, I am a newbie and a lot of the stuff you are talking about above and beyond normalization is over my head but I have to compliment you on an a fantastic article. I have gained a clearer understanding with regard to surrogate keys; something I never really understood. I always wondered why one couldn’t use a part number instead of “some random key”. I was obviously taught incorrectly. It makes perfect sense.

It is always easier to comprehend something when you back it up with a DETAILED example as you have done.

I love developing databases and hope to be technically where you are one day. Thank you for a great article and sharing your knowledge.

-Frank

Anonymous

Books
After reading a bit more about you, I see that you have authored some books on SQL Server. I was wondering if you have written anything on design techniques. Microsoft is not my game but I could sure use your design knowledge.

Thanks again,

-Frank
captonline at yahoo

Anonymous

Excellent!
Nice column. Particularly like the advice contained in “One table to hold all domain values”. Been in the position where people ahve done this and would not be pursuaded to see the light!

Anonymous

Great!
Great! I just do not agree with the 8th mistake. But it’s personal with databases store procedures
Congrats!

Anonymous

Brilliant arcticle
Good argument on all rules and I generally agree on all. Now all we need is use them 😉

You’ve Inspired me!
I write web spiders for aggregating content (Web 2.0 – Displaying the same data in a different way). My tables can get very large the longer the spider runs. I StumbledUpon your article and the “thought bubble” caught my eye, so I started reading. Fantastic stuff, after reading your article, I’m now going to go back and fix my databases and try to use good practices from now on. Thanks!

Anonymous

Thank you for info
I am no wizz kid on databases, your article is great as it opens my eyes to some of the mistakes on database designs etc. Information like this is invaluable to us humble humans that are not too technical.
Thank you

Anonymous

Normalization
I have set through two graduate level classes in Database and Object Oriented Development. In most instance, I was led to believe that with todays computing performance the idea of normalization is not as critical.

It appears that you do not agree. Do we really need to be overly concerned with normalization?

The SQL Server Thought Police

re: Normalisation
Yes! Goodness me Yes!

Anonymous

Generally, a quite good article, but…
“As a developer, you should rely on being able to determine that a table name is a table name by context in the code or tool…”

But in fact you can not. Consider this (frequent) scenario. To investigate the impact of a proposed change, you need to review the documentation (not just the code!) for references to table Customer. Not surprisingly, your documentation (external and internal) contains the word Customer, say, 4,000 times. (How long will it take to read those?) But if your table were named tblCustomer instead, the documentation might contain that word just 58 times, likely all of which are relevant.

For decades we have known that maintenance is 70-80% of the lifetime expense of software. Thoughtful planning should minimize the expense of using documentation.

<< On the ManagerID column, you should place a foreign key constraint … >>

Unless that Manager table contains historical data, this leaves you open to problems. E.g., what happens when a manager recorded in the “table in the database that records the size of the discount” is demoted? (But the use of the trigger sounds good.)

The article would also be better by mentioning (formal/informal) code reviews as an adjunct to testing.

Also, edit “(manly [sic] for triggers, abut [sic] also …”.

Anonymous

this article
THIS ARTICAL STINKS LIKE ROTTEN EGGS!

Anonymous

Identity pk, not a good idea?
I can´t agree with the point of identity´s as primary keys. One of the greatest problems when developing applications is that when the client wants to modify some field that are part of the primary key as a requirement you need to spend many hours to change this typical problem. I think that the best approach is to use ALWAYS an indentity, or similar (Oracle/PostgreSql sequences over a numeric field) and use always too unique index to restrict duplicates.

Anonymous

More replies from the author…
Thanks for almost all of the comments 🙂

To: Normalization
>>I have set through two graduate level classes in Database and Object Oriented Development. In most instance, I was led to believe that with todays computing performance the idea of normalization is not as critical. <<

Mercy. Normalization is just as important as it has ever been, and because of today’s computing performance it is much easier to implement. Normalization is nothing more than forming the data in the way that it is natural to work with in SQL and eliminate redundancies that you have to maintain. Performance-wise, you can expend energy keeping every redundancy straight, or you can expend energy calculating what you need when you need it.

I also should make the point that while I completely believe in a normalized database structure, what is far more important is a normalized design. If you understand the denormalizations you make, then it won’t kill you. The problem comes in when you have structures that are not optimized to use in a relational database query processor and you have to do unnatural acts to get things to work, or you miss conditions that are super-important to your client…

To: Anonymous
>>To investigate the impact of a proposed change, you need to review the documentation (not just the code!) for references to table Customer documentation (external and internal) << >>if your table were named tblCustomer instead, the documentation might contain that word just 58 times, likely all of which are relevant. <<

An interesting point. I would probably suggest a reasonable documentation pattern such as using [customer], or something like this, perhaps Customer table…

>><< On the ManagerID column, you should place a foreign key constraint … >>

Unless that Manager table contains historical data, this leaves you open to problems. E.g., what happens when a manager recorded in the “table in the database that records the size of the discount” is demoted? (But the use of the trigger sounds good.)<<

Good point. And truthfully, it would have been better to use EmployeeId there and have a check constraint/trigger to make sure that the person was a valid approver for a discount.

>>The article would also be better by mentioning (formal/informal) code reviews as an adjunct to testing.<<

Good point, and I can’t disagree, though technically code reviews serve a far different purpose than the sort of testing I was meaning. Code reviews are an essential part of the process so you have good coding practices, but the kind of testing I was referring to was making sure that the data is correct, which is more or less impractical in a code review.

To: this article
>>THIS ARTICAL STINKS LIKE ROTTEN EGGS! <<

Wow… Can you be more specific? Like exactly what an Artical is?

To: Identity pk, not a good idea?

>>I can´t agree with the point of identity´s as primary keys<<>>I think that the best approach is to use ALWAYS an indentity<<

I can only assume you are replying to another commenter, because I never design a table without an identity (or guid, if forced) primary key.

Anonymous

Disagree with Stored Procs
With ORM tools rife, ActiveRecord patterns becoming popular and database independance becoming more important now than ever (I wouldn’t dare sell a product which relied on proprietary DB’s), stored procs are a pain, if not extinct.

Anonymous

One table…
Does this issue illustrate the differences between application and database programming? For an application, performance is a direct function of code flexibility and efficiency because data is given to the application at the moment of use. For databases, however, the data is already present so code that anticipates the users request gives higher performance.

Anonymous

Good advice except for..
I agree with most of the points except…
Store procedures are NOT a best practice. Implementing too much business logic in them will result in a poorly maintainable system. You can get better encapsulation by implementing a proper business model, and placing the business logic in there.

Let’s face it. Most general purpose programming languages (GPL) are significantly more powerful than store procedures, and because of that you will create different architectures that can handle more requirements than you would with SP. With this freedom you’ll express much more complicated logic more susinctly in them, and get greater flexibility, more maintainability, and just as good performance. Most GPL are OO and they have objects for encapsulation, inherritance and components for reuse, and offer much more expressiveness through dynamic binding.

That’s not even getting into the mapping inconsistencies between your object model and procedure create.

Stored procedures are a left over from 1990 architecture modeling. It’s much more effecient to use ORM/Data Mapping tools to develop your code.

Anonymous

Underscores in names?
Hi, thank you so much for such an informative article!

There is some debate in my project on naming conventions. One thing that I’d like to get your perspective on: is there any reason why one should not have an undercore in names?

Above, you mention that “part_number”, for example, is acceptable. I’m inclined to think that this makes things much more readable, especially when users need to go directly to the tables (no front end app yet)…. but what reason might someone have for saying that undercores make future development difficult?

Much thanks!

Anonymous

No to stored procedures
I agree with the comment at Wednesday, May 30, 2007 at 9:15 PM.

All the examples given to support NOT using stored procedures can be satisfied with non-store procedure solutions.

Makky

Complex Busniness logic and Stored Procedures
what is your take on this?
performance wise wouldnt it be better to do this in application level. application languages like C# would out perform t-sql wouldnt it?

Makky

Using natural identity as PK is slow
indexing on columns that have complex data and/or data types is not a good idea. sql will always retrieve records quicker when searching on an indexed field that is of a simple type like an IDENT key field. Yes or No?

Anonymous

Ten Common Database Design Mistakes
Very good article.

AnnaL

Anonymous comments disabled
We’ve had to disable anonymous comments on this article due to relentless spamming.

To post a comment please sign in, or register if you are not already a member.

ams101

Meaningless keys
It is a vry strong argument that every column in a table should be meaningful. But there is on important situation which I faced more than often, especially in material data: if you have to import data of limited quality and doubtful source (XL etc.) you will love your additional system key and an additional record status column.

Another horrible real-world example:
The 1995 newly designed 5-digit postal zip codes (!!!) were not unique and are still not unique. You will need two additional columns, one for a proper id and one for East/West (the Wall still exists at least in BI SW).

sgtish

I don’t know much but…
This article has been very informative for me. The only things I know about database design are the things that I’m learning / learned the hardway. I don’t have a SQL Server and have to work with MS Access but the beginning part all the way down to “Lack of Documentation” did reinforce all the lessions I’ve learned so far. After that things started to get fuzzy because that was all new to me except for the testing part. I’ve gave the newly built database to the users and they have found more bugs than I had anticipated and, fortunately or luckily whatever you want, they were relatively easy to fix. I fully anticipate that when they really start to use it the serious bugs will rear their nasty heads.

I think that in the future better planning, user input, and testing will help me a lot.

rhasken

HELP!! Requirements Analysis
I’m new to this, so please bear with me. I work at a start up company which was started ~ 9 years ago. We’re handling more work than we ever have before, and there’s a strain on what is available in our database, as compared to what we want. While planning our database improvement projects for this year, we realized that we probably need to create a master plan to detail what we want from our database. From what I’ve recently read, we’re going to go through the Requirements Analysis process of database planning. So, finally, my dilemma. How do I, as an engineer and a manager, communicate with our database team? Is there a format for creating a flowchart, or the like, that will help me to ogranize my ideas in a way that the implementors of our database will understand? Any advice is greatly appreciated. Thanks, Rob.

PeggyScott84

Thank you!
I am a newbie. It was a great article to get me started. Spoke about things I hadn’t anticipated (The point on surrogate v/s natural keys).
Thank you!

Seven24

Normalization
Ran across this article today. Well written, but I’ll add one item about normalization. Obviously with any guidelines, one of the guidelines is that sometimes you have to deviate from them. Let’s take the example of having a table with only a surrogate key having a unique constraint. I whole heatedly agree with the sentiment that, in general, if a user cannot differentiate between the rows, then a unique constraint is missing and that every table should have at least one unique constraint over fields other than the surrogate key. However, let’s suppose we have a table called Person with child tables of Address and PhoneNumber where we allow a person to have zero or more addresses and zero or more phone numbers. It is very possible to have two John Smiths in our Person table that represent different humans. However, a user would differentiate one John Smith from another based on their addresses and phone numbers. In this scenario, there is no combination columns in the Person table alone that would uniquely identify one person from another (Suppose also that no unique key from any outside entity that is being captured such as driver’s license or SSN). Thus there are *rare* (very, very, very rare) times when there is no logical unique constraint that can be put on a table.

rudster

ORM – Object Relational Mapping
I was wondering what your thoughts are about ORM – Object Relational Mapping technologies which have been flourishing as of lately.

ORM in some ways fits in with most of the best practices under the concept that every entity represents only one thing. The benefits of having a domain model mapped to classes with most languages are now the cornerstone of a lot of frameworks (ex. Ruby On Rails, Hibernate, etc.)

One thing that needs to be thought about now is when to use a stored procedure as opposed to retrieving data from an ORM framework.

1) Maintainability – if the tables within the database change then the stored procedure needs to be updated. On the other hand, an ORM has built in capabilities to remap the data model to the new fields without needing to write sql. Sometimes it is a matter of adding a new property to a class. If ORM is used then using stored procedures to do CRUD seems to not provide additional levels of maintainability. The case in which these two cases are head to head is in the one table SELECT, UPDATE, DELETE and INSERT cases. On the other hand, if the operation covers multiple tables which provide optimization in CRUD operations than the stored procedure is the better route, but i have to admit I haven’t really run into this situation when using ORM yet.

2) Encapsulation – for me, if a view can be created to do the same thing that a stored procedure is accomplishing, then the view should be the first choice in providing an interface to access the data within a database. Views also have the advantage in that it can be mapped with ORM

4) Security – i think defining user privileges for each table is the way to go here. stored procedures can provide some encapsulation to prevent unintended changes but in general it doesn’t prevent them from making the changes directly to the table. defining the user’s privileges within the database will determine what the real security is.

Eric_Scherrer

Tough topic, but you nailed it! I have one more…
Another thing I see that drives me crazy is lazy naming of columns, particularly in the case of subtypes that did not warrant their own table. What I mean is when the database designer did not put enough thought into naming there attributes and just started naming columns as [entity]type, [entity]code or [entity]status. A classic example is UserStatus in a table called User. On launch date it would have the following values: {Married, Single}. After some time it is decided that more UserStatus’s are needed so the following are added {Divorced, Widowed}. Then a few years later even more are stuffed in there {Dependent, Ineligible}. By lazy naming that attribute the door was left open for abuse of that column. It leads to derived values being added (divorced, widowed), non-like values being added (Married, Ineligible) and conflicting values being added (you can be married and Divorced at the same time). I can guarantee this will happen every time you see a column ending in code, type, status, or any generalization thereafter. It usually means the designer was just too lazy to give some thought into properly naming the column. In this case it should have been MaritalStatus, or even better there should be a separate table with marriage activities since a person can be married many times.

nishantsinghai

Must Read article for beginers
I read above article and immediately joined this forum. This is must read for beginners like me.
Thanks for sharing this with us!

herbey

One table to hold all domain values :
Great article, I partially agree in this one though. Yes, if you know what the domain is ahead of time I totally agree that a targeted architecture is the best choice, however, when designing databases that ARE supposed to be general purpose, there is no way out but an all encompasing table/s, I’ve bumped into quite a few situations in which I have had to design such tables and I have also worked with commercial systems that use general purpose underlying architectures in succesful ways. True, maintenance, queries and constraints have to be enforced by code, thus making the operation more challenging, but the end result can work beautifully. One great advantage is reporting on many domains using the same underlying schema, the extra difficulties you underwent writing the complex queries could be balanced when you don’t have to worry about remembering schemas you designed 5 years ago, as the same schema applies to the relevant domains.

steveoqld

George Harrison must have been extremely old …..
George was a great wordsmith but he certainly was not the original source for your quote “If you don’t know where you are going, any road will take you there”.

The source is in fact Alice In Wonderland by by Lewis Carrol some 100 years previously.(1865)
It was the Cheshire cat in response to Alice.

It just goes to show that anyone can write anything on the internet these days and most will believe it. I wonder how much error or made up material without proper editorial exists ? Some say as much as 90% but then stats are all made up anyway hey 😉

Rowland

Surrogate Keys, Genericizing Code
Admittedly this is three years after your article appeared — sorry for the delay 😉

The IDENTITY() thing:
This has its roots in all sorts of Microsoft culture. Example after example reinforce this is as “best practice”. Tools such as SSMS subtedly suggest that the Primary Key and the Clustered Index are the same thing — or a at a minimum are inextricably bound. To a new user it would appear that way and of course they are not.

The truth (I think) is monotonically increasing integers minimizes page splits and help system performance. Using them as a surrogate primary key (and by way of extension through the tools) creates the typical clustered index. This bypasses a lot of stuff you just can’t get off your shoes with a stick 😉

Genericizing Code:
Look no further than Microsoft’s SharePoint or CRM for your latest bad example.

Thanks for a nicely thought out article.

Phil Factor

Re: Surrogate Keys, Genericizing Code
The use of the identity key causes problems with a distributed database. Even Microsoft have started to warn people not to use it in such circumstances. The problem comes where entries are made in the same table in two different databases that then require data-merge/synchronization. A natural key works a lot better.

Derek Asirvadem

Re: Normalisation/Steven24

1. The suggestion in your example that multiple PhoneNos or Addresses will uniquely identify a Person is not correct. They are multiples. Two discrete John Smiths could have lived in the same Address. It will not help anyone uniquely identify a Person.

2. Actually that problem was nailed in the 1980’s. The correct columns required to uniquely identify a Person is (LastName, First Name, BirthDate, BirthPlace, BirthCountry). All 1::1 with the Person Identifier, in the Person table, not in some other table.

3. Every table must have an unique key (not necessarily the Primary Key) which uniquely identifies each *row*. Otherwise it is simply not a Relational table. Artificially making rows unique, by adding a surrogate key, does not make the row unique.

—

Concepts such as “no Surrogate keys” are just as silly as “all Surrogate keys”. Surrogate Keys are always an *additional* column and index. They need to be minimised and used wisely. In a Relational Database, the only valid circumstance for a Surrogate Key is when the natural relational PK is too wide to (eg. above) migrate into child tables (and thus PersonId is added as PK).

Natural keys cannot be substituted or dropped. In some case Surrogate keys need to be *added*.

A separate point. Most ‘databases’ these days are not Databases or Relational or normalised at all; they are merely application (developer) spreadsheets implemented as ‘tables’ and linked together by row_id. For those, Identity columns are fine. They have no Relational capability and the developer isn’t expecting any.

Georgie Porgie

Re: #2, are you implying that LastName, First Name, BirthDate, BirthPlace, and BirthCountry all combine to create a unique identifier for each person? Because if so, then this is not true. You could conceivably have two babies born in the same hospital on the same day with the same name. You still need a truly unique identifier here (SSN, etc.)

simohammed

Inheritence
Hi,

Thank you Mr. Louis Davidson for this usefull tutorial, but you haven’t clarified one very big problem ‘The inheritence’ which poses me a serious problem when designing databases.

I have a table “comments” which contains :
id, user, comment, parent, timestamp

This table contains comments for posts in a blog, images, and some other objects

In order to avoid setting as many tables as objects (comments_images, comments_posts etc.), I’ve added an additional field :
object which can take the following values : post, image etc.

Bu this approach violates the integrity of my database : I cannot create a foreign relation between the the table of comments and the tables of objects (posts, images etc.)

Have you another solution to address this particular problem of inheritence without violating the integrity

Thank you in advance

HP

hi Mohammad,
I am in same situation as you’re. What worked out for you? Differnt columns with FK constarints or One column and type but removing FK constriant?

Thanks

mperrigon

Testing your database design
Lack of testing might seem like an obvious one but it’s the one that gets me in trouble most often. What are some of the techniques you use for testing?

stored procedures
After reading this article I feel I will do a good study of these stored procedures. It sounds like they are handy like function calls.

I favor descriptive terms like tblProductID and I insist on using forms and other objects in such a way that no user, not even the DBA gets to see them anyway. I feel it is the developer’s responsibility to create an application that totally insulates the objects from ALL users especially a DBA who might dabble in code and decide they know a better way to do something, then get called to fix what the DBA broke. The only access I give to a DBA as far as extra access is to add.edit users and passwords etc. I include a suggestion box for Db design wishes that can be accommodated when such things are a real priority. No need for some dabbler to add things people want to the Db that ul;timately go unused anyway and only add to bloat.
Rollin Shultz
Allentown, Pa

WghUk

Excellent article! Thank you.

John McPherson

Hi Louise, good article.
There is one point I would like to bring up that needs to be in this article and that is using generated vs external data as keys in the primary/foreign key relationships between tables.
I am a firm believer in system generated primary/foreign key values that are used to tie the tables together. These keys should never rely on an external source for the relational mapping as external sources cannot be controlled hence, they are a referential integrity problem ‘waiting to happen’.
And, most of the time, they should not even be exposed to users as they have no other meaning other than to tie one or several tables together.
There should be ‘other’ columns that are used for user lookups or searches such as CustomerName or DriversLicense.
That’s the only thing I would add.
I cannot tell you the number of times I have seen RI problems in legacy systems that violated this rule.
Thanks and have a great day!

sylvain

Hi sir John,
I appreciate your answer. I’m beginner in database and actually studying sqlite. Please, I want to get deep understanding on these two points :

1. By generated primary, do you mean auto_increment ?

2. How to use generated primary to tie tables in database ? Please, Can you give some links that could help ?

Sorry for my english level.

Thanks.

johnnycardy

1 – yes
2 – insert into table A and get back the primary key (the auto_increment value), then insert into table B with that value as the foreign key.

Rakibul Haq

yet old, it is a very useful article.

Thanks

Corbin

Does a great job outlining many of the elementary design pitfalls that you’ll find in your average corporate data model.

HistoryChannelGuy

After reading the article and then some of the comments and responses. While the article is indeed well written and correctly stated as far as I can tell, it seems the author didn’t quite grasp what he wrote. Which is kind of odd, but can make sense I guess. Some times a thing is sitting right in front of you and one can fail to see it. Or maybe he just didn’t respond to the comments a 2nd time to say, oh yeah, duh.
At any rate, good tips and we’ll laid out and presented. A good read in 2018 as it was in 2007.

SoulFireMage

Idle thoughts whilst I was hunting for something – I’m in the middle of an export and design new schema project and have all this opportunity to *cough* do it better. Yet as I read, I still manage to commit one of the cardinal sins knowingly across the whole design – that of the primary key fallacy. However, well see below, happy to argue that with someone more knowledgeable!

Naming:

I’ve never like tibbling in naming – prefixing/suffixing should be a last resort IMHO.
I’ve lost track of how many times an IDE shows me a million tblxxx variations in old schemas.

Why not just use schemas other than DBO, then keep the name as clean as possible? E.g SAP.SalesOrders, SAP.PurchaseOrders, CCU.Companies, CCU.Users. Here I’m choosing internally meaningful 3 letter names but that is arbitrary.

Even with hundreds of tables, I don’t have a huge list to parse alphabetically – I should already know what schema I need, often that will only have 10 – 50 tables in it.

It does need diagramming – I use Vertabelo to plan and brain storm this stuff out first as swapping schemas around may be awkward later down the line. However it’s an easier format to navigate, just requires you know your database and why things have been organised as they have.

Primary key:

Too often I can’t see an obvious natural key and I’m biased by a reliance on Entity framework on the code side.

So, I stick with a convention – ID column = PK, TableNameID = FK.

It’s not a distributed database and won’t ever be (if it was, I’d probably have to go down the GUID route or read up on distributed database design).

I’ve seen loads of older schemas when exporting data and so on and so far it’s proven more reliable than anything else I’ve encountered.

Yes it does have the flaw that you can simply duplicate rows and I get this happening with inserts doing export projects. So, I agree with the other point, there should be some other uniqueness constraints-though each one is domain Table specific and I wonder how many accidental DBA’s have the time to do the correct amount of discovery to decide this?

Stored Procs (and Views and Transactions):

Woefully underused I think!

Why don’t they preach more widespread use of Transactions? (And support Table valued params on the code layer (EF6.x) a bit better too?).

These look like a god send for complex business transactions done via web apps. Just send the parent with children and children of children items in one parameter, make sure the transaction is completed with all the usual error checking/rollback stuff then report back success or failure.

As for views and stored procs they’re too handy to miss out on. I’m surprised anything really complex relying on a database gets by without using one or both of these. Weird. I think it comes from the philosophy of database as dumb storage – nowadays one could use a NoSQL solution instead for that.

EAV tables:

Yuk, hate them – think they are better off using a NoSQL solution instead, I believe there are all kinds of options that stop one breaking the data integrity notions of a sql database for generic tables.

Bret Taylor

A couple days ago I didn’t know what SQL was. I’ve read articles and watched youtube videos and nothing came close to teaching me as many fundamentals as this article. It took me a while to get through it but I did and understood most of it. Thank you.

You want to use synthetic primary keys on every table. This provides an abstraction for each row and also for the natural key(s) which is what you want when joining tables. Try joining a table with a composite key; you will likely be inclined to concatenate the 2 keys when joining. You would also have to carry those 2 fields into every table you want to join from as a reference. This is eliminated with a synthetic or surrogate key. Also, joining using integers is the most performant wrt query execution.

However, and specifically to the point I believe you are asking: you also need to utilize UNIQUE INDEXES in addition to the primary key to force uniqueness on the NATURAL KEY. This strategy provides the best of both: Guaranteed uniqueness of the natural key and a simplified and performant way to join tables using the surrogate key.

Are you looking for a guarantee and a powerful spell that works without black magic or side infections visit shangosolutiontemple @ yahoo . com he helped me to get my ex back and we are both living together happily, peacefully and wealthy. Priest Okojie might be the answers to your problems.

Related articles

One great result from PASS Summit, especially when we are close to a new SQL Server release, is to identify important technologies to study on the following year. PASS Summit 2018 was great, with sessions about many new technologies giving us very good guidance on where to focus our study for the new year. Let’s … Read more

A few weeks ago I faced this problem: One query on my application was (fortunately in the development environment) was facing a very bad execution time. Since the query was generated by entity framework, I used SQL Profiler to capture the query with all its parameters and execute in SSMS. The query was created using … Read more

When I read the list of new features in SQL Server 2019 I became very proud of my crystal ball powers. In July 2017 I published an article about Graph Database feature in SQL Server 2017. In this article, besides showing the improvements and benefits I also highlighted one problem: the lack of graph edge … Read more