Database Design 101

Most IT
professionals and developers understand the basic purpose of databases
- to store data. But there's more to it than that. For example, you need
to decide how the data should be stored. Good decisions will lead to an
extensible database. Poor decisions may meet immediate requirements can
lead to many headaches in the long run.

If you're
strictly a database administrator (or a systems administrators who's responsible
for the database servers), there's a good chance that the contents of
the databases you administer are "black boxes." You understand the importance
of database management issues - performing backups, testing restores,
managing database file growth and monitoring and optimizing performance
- but you're probably not familiar with the basics of how table structures
are designed.

On the other
hand, if you're a database developer, it's critically important that you
understand how to design tables, views and other database objects. A seemingly
simple decision about how to arrange Customers and Order information within
a database can have severe impact on usability, manageability and performance.
Regardless of your background with relational databases and how they work,
a solid understanding of database design issues can form a great foundation
for successful database design and management. As you might have guessed
from the title, that's the topic of this article - the basics of database
design.

Note

In the previous three articles in this series, I covered the
basics of database objects and how SQL queries can be used to retrieve
information. Although this month's topic is somewhat independent of
the subject of those articles, it may help if you understand their
content before reading this column.

In this column,
I'll provide an overview of database design concepts. Instead of focusing
on the specifics of such concepts as referential integrity and database
normalization, I'll provide some basic info and point you to resources.
The goal will be to understand the factors that make for good design decisions.

Determining Business Requirements
As with any technology-related project, you should clearly define
the types of business problems you're trying to solve before you even
think about solutions. The design and implementation of databases is no
exception. The first step is to determine whether you need a database
at all. All too often, it seems that when people want to collect information,
this is the way to go. Most of the time, it probably is.

What reasons
would compel you to house your important information in a relational database?
Some reasons might include support for concurrency (that is, multiple
people can access and change the data at the same time), data security
(allowing people permissive access) and data manageability (modifying
and reporting on large quantities of data). If you determine these features
to be the basic requirements of your solution, chances are that a relational
database is the best choice.

Before you
begin the design of a database, ask these questions:

What
is the purpose of the database? Often, the answer to this high-level
question is obvious. For example, someone might request a database to
track customer information such as names and addresses. You might need
a method to keep track of inventory. In any case, you should understand
whether the database will be used primarily for the storage of transactional
information, or whether it will be used mainly for reporting purposes.

Who
are the subject matter experts? In the ideal world, the developer
would be the world's foremost expert on the business usage of the planned
database. But it's the real world, so before you embark on a quest to
design the perfect database, be sure you know which individuals within
your company can answer questions that you might have. For example,
if you're designing a financial database, you might need to know about
the smallest and largest acceptable values for an "Amount" column. Knowing
to whom the question should be asked is half of the solution.

What
are the detailed requirements for the system? Be sure you can identify
all of the entities (such as Customers, Orders and Invoices) that must
be tracked within the database. You should also understand how these
entities are related. It's important to be specific when you answer
this question as a few small changes in the requirements can call for
significant rework of a database design.

What
types of users and applications will access this database? It's
important to understand how the data you choose to track will be used.
For example, will sales people frequently be viewing Customer information
by region? Will hundreds of data entry clerks be entering thousands
of Order records per day? The more information you have about the usage
of your data, the more appropriate will be your database design.

What
are the plans for the future of this database? Is it expected that
users will be happy with the types of information we are planning to
collect? Or, is it likely that they're want to add more to the database
and/or integrate this information with other systems within the company.

If you have
the above information, you're probably in good shape. So far, you know
what you need. Now, let's move on the look at some considerations for
designing a database.

The Basics of Database Design
When faced with "designing" a database, novice database implementers
often try to think only about the storage of data. However, the method
in which information will be retrieved from your database is just as important.
Given a set of business constraints, I may be able to come up with multiple
database designs. Some of these, however, will make it difficult to retrieve
data in the format in which I need it. When you're designing a database,
it's important to first think about the types of entities that you want
to track. Let's continue to use the example of a Sales Order Entry system
(so far, I've mentioned Customers and Orders). In such a system, you should
be able to identify the characteristics of Customers, the characteristics
of Orders, and how these entities interact.

A typical
statement that describes entity relationships might be something like
the following: "A Customer can have zero or more Orders, and all Orders
must be linked to an existing Customer. An Order can contain one or more
Line Items. Each line item includes information about exactly one Product
that has been ordered. Each of these Products is identified by a unique
number."

Figure 1
shows an example of a basic database schema that shows these relationships.
The schema was created in SQL Server 2000 and is provided in the "pubs"
sample database. However, many different database design tools allow you
to easily create tables and define their relationships. For now, don't
worry about the details of the lines between the various tables that I've
created.

Now that
we've covered a simple example of how a database can be designed, let's
build upon this concept.

Dealing with Relationships OK, so this title might sound like something out of a self-help book.
Well, it sort of is - self-help for database developers. It's important
to define and understand the relationships between objects in your database.
If you've been following the examples thus far, you know that it's important
to be able to enforce the rules that the database objects must follow.
For example, you may want to make sure that a column named ZIPCode contains
only valid United States ZIP code formats.

Note

This article primarily focuses on database design issues,
in general. In some cases, SQL Server 2000 tools and terminology will
be used to illustrate specific concepts. If your environment does
not use SQL Server 2000, rest assured that the same information applies
to most relational database systems that are available today.

In this section,
we'll look at ways to make sure that only the right types of information
are entered into your database. We'll focus on ways to ensure that your
information is kept consistent and the relationships within and between
tables are retained.

Defaults, Rules, and Constraints
When you define a column within a table, there's a good chance that
you know the types of values that are acceptable. A PhoneNumber field
might allow only integers, whereas an Address field might allow alphanumeric
characters. The first step in determining acceptable types of data is
in the table definition. When you create a table, you must choose from
among the data types that are available in the table designer (see Figure
2). In SQL Server, for example, you might choose an int (an integer data
type), a varchar (a variable-length character field), a text (long text
fields), or a boolean (true/false) value for a column. Your choices here
will determine what types of information can be stored in these columns.

Choosing
a data type limits the types of data that can be stored in a column, but
sometimes you want to take it even further. In addition to restrictions
defined by data types, you can place other constraints on the types of
information that are acceptable in a column of data. SQL Server provides
several different types of constraints:

NOT NULL
- A value must be specific for this column.

CHECK
- The values supplied must meet the criteria specified in the constraint.
A database developer can define a CHECK constraint to ensure that an
entered integer is an even, positive value.

UNIQUE
- No values in this column may be duplicates of another. This might
be used, for example, in an employee information table to prevent duplicate
employee numbers from being entered.

PRIMARY
KEY - Defines which column or columns uniquely identify each row in
the database. No two rows can have the same values for the primary key.

We'll discuss
the PRIMARY KEY and FOREIGN KEY constraints in the next section. CHECK
constraints place limitations on the types of information that can be
stored in a specific column of data. For example, I might want to restrict
the value in a phone number column to a 10-digit format (with no dashes
or other characters). A constraint can ensure that information is entered
in numeric format. If the information does not meet the criteria, an error
is returned to the user.

Constraints
can be placed on one or more columns and can be quite complex, if that's
what your business rules call for:

Defaults
are settings placed on a table that specify which values should be used
if none is specified. This type of constraint is commonly used in
situations where the database assumes that certain values should be
used unless they are otherwise specified by a user or an application.
For example, I might want to create default responses for the columns
that represent questions in a simple customer survey. If no value is
provided for these columns, I may want the value to default to "N/A".

Rules
function similarly to constraints but have the added benefit of existing
as database objects. In contrast to constraints-which are defined
as part of a column's definition-rules can be "bound" or "unbound" to
columns. This allows the flexibility of disabling a rule without losing
its definition. However, only one rule may apply to a column's definition.
Rules are provided mainly for backward-compatibility with SQL Server
applications. Microsoft recommends that, wherever possible, CHECK constraints
be used instead of rules.

Now that
we have a good idea of the types of column-level constraints that are
available, let's look at the various types of referential integrity that
are possible.

Domain, Entity, and Referential Integrity
It is possible to create a database in which all the information stored
in tables is completely unstructured and unrelated. This would lead to
many problems, however, as is often learned by those who do not take the
time to adequately plan the structure of their databases. Generally, difference
pieces of information stored in your database objects relate to each other
in some way. Again, let's revisit the commonly used example of a sales
database. Each sale might be tied to a customer, but the actual information
about the customer (including shipping address and purchase history) might
be stored in other tables. In this case, a clear relationship between
the two tables must be kept intact. Additionally, business rules might
require that each customer have a unique customer number that should never
be reused.

Part of playing
the database game is to ensure that the rules are followed; that is, you
must be able to enforce the relationships between entities in your database.
Consider the case in which you have a simple Customer - Orders relationship.
Suppose that in your system, a Customer can have zero or more orders,
and all orders must be attached to a customer. This might seem like an
easy rule to define, and you can reasonably code for it within your application.
But what happens if another developer forgets the rule and allows users
to delete a customer who has several orders without first deleting all
of the orders themselves? In this case, you'll have violated the integrity
of your database by creating "orphan" records in the Orders table. The
simple answer is that this shouldn't be allowed to happen, and the developer
or end user should receive an error. Stopping short of this, you run the
risk of entering invalid information into your database (something that
can be a very serious problem and can be difficult to correct after it
occurs). So what's a good database designer to do? Fortunately, there's
a relatively easy solution.

Integrity
constraints are created to ensure that these relationships are maintained
in a consistent manner. There are three major types of integrity that
database designers must keep in mind:

Domain
integrity - Ensures that values stored within a column are consistent
and in accordance with business rules. Domain integrity can be based
on constraints such as UNIQUE and CHECK constraints that specify what
values are acceptable for each column.

Entity
integrity -- Refers to information stored in rows (remember that
each row in a table stores information about one entity of the type
that the table describes). This type of constraint makes sure that the
information stored in rows within a table is consistent and follows
the rules specified. For example, each row must contain the same number
of columns (although some values may be left blank).

Referential
integrity - Applies across tables and ensures that information between
these objects is consistent. Referential integrity includes relationships
between tables. The actual columns that match between the tables are
known as foreign keys and primary keys, and they can be defined using
PRIMARY KEY and FOREIGN KEY constraints (mentioned earlier). Referential
integrity ensures that related information remains consistent. It solves,
for example, the problem I mentioned earlier: ensuring that only valid
customers are used for all orders placed in the database and avoids
the problem of "orphan rows". Orphans might occur when a customer row
is deleted from the database, but the customer still has orders. In
this case, the orders are orphans since their parent row (the customer
information) no longer exists. When it comes time to fulfill the orders,
users will find that they do not have enough information.

With the
basic ideas of database integrity out of the way, let's move on to another
important topic: database normalization.

Understanding Database Normalization
Remember
earlier in this article when I referred to the fact that there can be
several good database designs that meet a given set of requirements? There's
definitely a subjective aspect to database design. One important concept
is that of normalization. Normalization refers to the level of separation
of data into multiple tables. There are many rules of normalization, and
they have been described in various ways. To remain at a high-level, I'll
just provide the basics. One rule of normalization is that every column
within a table should refer only to one entity. For example, if I have
a Customer table, every piece of information within that table (like the
customer's name and address) should correspond only to that customer.
Additionally, each row within the Customers table should have a unique
identifier of some sort (such as a Customer Number value that is unique
for each customer).

A highly
normalized database schema is characterized by having many "narrow" tables.
By narrow, I mean the tables tend to have few columns. A denormalized
database schema, on the other hand, tends to have fewer, wide tables.
In general, a normalized database is easier to understand since tables
refer to very specific entities.

So what should
you consider when you decide how normalized your database should be? Well,
performance is an important issue that is addressed by normalization.
Highly normalized tables can improve performance in transaction processing
systems as they limit the performance impacts of locking. If your database
experiences a large number of data modification queries (INSERT, UPDATE
and DELETE statements), then normalization is the way to go. On the other
end of the spectrum, a denormalized database schema provides better performance
for reporting functionality (SELECT statements). Since you'll have to
join with fewer tables in your queries, reporting, especially with large
sets of data, can be much quicker. Of course, denormalization comes at
a cost: You may have redundant data, which requires additional storage
space.

Remember
that the concept of normalization is a continuum. That is, it's not like
a database is either normalized or it's not. Rather, it can be "somewhat
denormalized", or "highly normalized". Books have been written about database
normalization techniques and recommendations. We don't have room in this
article to dive into the details. Remember, this is an overview!

Summary
So, there you have it - the basics of database design. In this article,
we looked at the basic design goals that you should keep in mind when
you are developing a database solution. We covered the all-important concepts
of referential integrity and database normalization. Both of these aspects
form the basis for much longer discussions about how databases should
be designed. The good news is that, if you understand these concepts,
you're well on your way to designing efficient databases that meet your
business requirements. The bad news is that we only scratched the surface
within this article. Nevertheless, be sure you take the time determine
your business requirements and design an appropriate database. Sometime
soon, someone will be glad you did!

Additional Information

Microsoft wants to test your ability to design real-world
database solutions and to take advantage of all of the many features
available in SQL Server 2000. With exams that are related to database
design and implementation, it can be difficult to judge when you're
really ready. I recommend having at least several months of experience
in working with SQL queries and designing databases, to give you a
strong foundation.

With
this information in mind, look at all of the new features in SQL
Server 2000 that you have NOT used: don't forget about XML features,
materialized views, partitioning and other features, just because
you don't use them. The following resources should help you obtain
more information about SQL Server 2000: