Data modeling is the act
of exploring data-oriented structures. Like other modeling artifacts data
models can be used for a variety of purposes, from high-level conceptual models
to physical data models. From the point of view of an object-oriented
developer data modeling is conceptually similar to class modeling. With data
modeling you identify entity types whereas with class modeling you identify
classes. Data attributes are assigned to entity types just as you would
assign attributes and operations to classes. There are associations
between entities, similar to the associations between classes – relationships,
inheritance, composition, and aggregation are all applicable concepts in data
modeling.

Traditional data
modeling is different from class modeling because it focuses solely on data –
class models allow you to explore both the behavior and data aspects of your
domain, with a data model you can only explore data issues. Because of
this focus data modelers have a tendency to be much better at getting the data
“right" than object modelers. However, some people will model
database methods (stored procedures, stored functions, and triggers) when they
are physical data modeling. It depends on the situation of course, but I
personally think that this is a good idea and promote the concept in my UML
data modeling profile (more on this later).

Although the focus of
this article is data modeling, there are often alternatives to data-oriented
artifacts (never forget Agile Modeling’s Multiple Models principle). For
example, when it comes to conceptual modeling ORM diagrams aren’t
your only option – In addition to LDMs it is quite common for people to create UML
class diagrams and even Class Responsibility Collaborator (CRC) cards instead.
In fact, my experience is that CRC cards are superior to ORM diagrams because
it is very easy to get project stakeholders actively involved in the
creation of the model. Instead of a traditional,analyst-led drawing session you can instead
facilitate stakeholders through the creation of CRC cards.

Although methodology
issues are covered later, we need to discuss how data models can be used
in practice to better understand them. You are likely to see three
basic styles of data model:

Conceptual data models. These models, sometimes called domain models,
are typically used to explore domain concepts with project
stakeholders. On Agile teams high-level conceptual models are often
created as part of your initial requirements envisioning efforts
as they are used to explore the high-level static business structures
and concepts. On traditional teams conceptual data models are often
created as the precursor to LDMs or as alternatives to LDMs.

Logical data models (LDMs). LDMs are used to explore the domain concepts,
and their relationships, of your problem domain. This could be
done for the scope of a single project or for your entire
enterprise. LDMs depict the logical entity types, typically
referred to simply as entity types, the data attributes describing those
entities, and the relationships between the entities. LDMs are
rarely used on Agile projects although often are on traditional projects
(where they rarely seem to add much value in practice).

Physical data models (PDMs). PDMs are used to design the internal schema of
a database, depicting the data tables, the data columns of those tables,
and the relationships between the tables. PDMs often prove to be
useful on both Agile and traditional projects and as a result the focus
of this article is on physical modeling.

Feature

Conceptual

Logical

Physical

Entity Names

✓

✓

Entity Relationships

✓

✓

Attributes

✓

Conceptual Model
Design

Logical Model Design

Physical Model
Design

Primary Keys

✓

✓

Foreign Keys

✓

✓

Table Names

✓

Column Names

✓

Column Data Types

✓

Below we show the conceptual,
logical, and physical versions of a single data model.

Although LDMs and PDMs
sound very similar, and they in fact are, the level of detail that they model
can be significantly different. This is because the goals
for each diagram is different – you can use an LDM to explore domain
concepts with your stakeholders and the PDM to define your database
design. Figure 1 presents a simple LDM and Figure
2 a simple PDM, both modeling the concept of customers and addresses as
well as the relationship between them. Both diagrams apply the Barker
notation, summarized below. Notice how the PDM shows greater detail,
including an associative table required to implement the association as well as
the keys needed to maintain the relationships. More on
these concepts later. PDMs should also reflect your organization’s
database naming standards, in this case an
abbreviation of the entity name is appended to each column name and an
abbreviation for “Number" was consistently introduced. A PDM should
also indicate the data types for the columns, such as integer and char(5). Although Figure 2 does not show
them, lookup tables (also called reference tables or description tables) for
how the address is used as well as for states and countries are implied by the
attributes ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.

Figure 1.A simple logical data model.

Figure 2.A simple physical data model.

An important observation
about Figures 1 and 2 is that I’m not slavishly following
Barker’s approach to naming relationships. For example, between Customer and Address there
really should be two names “Each CUSTOMER may be located in one or more
ADDRESSES" and “Each ADDRESS may be the site of one or more
CUSTOMERS". Although these names explicitly define the relationship
I personally think that they’re visual noise that clutter
the diagram. I prefer simple names such as “has" and then trust my
readers to interpret the name in each direction. I’ll only add more
information where it’s needed, in this case I think
that it isn’t. However, a significant advantage of describing the names
the way that Barker suggests is that it’s a good test to see if you actually
understand the relationship – if you can’t name it then you likely don’t
understand it.

Data models can be used
effectively at both the enterprise level and on projects. Enterprise
architects will often create one or more high-level LDMs that depict the data
structures that support your enterprise, models typically referred to as
enterprise data models or enterprise information models. An enterprise
data model is one of several views that your organization’s enterprise
architects may choose to maintain and support – other views may explore
your network/hardware infrastructure, your organization structure, your
software infrastructure, and your business processes (to name a few).
Enterprise data models provide information that a project team can use both as
a set of constraints as well as important insights into the structure of their
system.

Project teams will
typically create LDMs as a primary analysis artifact when their implementation
environment is predominantly procedural in nature, for example they are using
structured COBOL as an implementation language. LDMs are also a good
choice when a project is data-oriented in nature, perhaps a data warehouse or
reporting system is being developed (having said that, experience seems to show
that usage-centered approaches appear to work even better). However LDMs
are often a poor choice when a project team is using object-oriented or
component-based technologies because the developers would rather work with UML
diagrams or when the project is not data-oriented in nature. As Agile
Modeling advises, apply the right artifact(s) for the
job. Or, as your grandfather likely advised you, use the right tool for
the job. It's important to note that traditional approaches to Master
Data Management (MDM) will often motivate the creation and maintenance of
detailed LDMs, an effort that is rarely justifiable in practice when you
consider the total cost of ownership (TCO) when calculating the return on
investment (ROI) of those sorts of efforts.

When a relational
database is used for data storage project teams are best advised to create
a PDMs to model its internal schema. My
experience is that a PDM is often one of the critical design artifacts for
business application development projects.

Halpin
(2001) points out that many data professionals prefer to create an Object-Role
Model (ORM), an example is depicted in Figure 3, instead of an LDM for a
conceptual model. The advantage is that the notation is very simple,
something your project stakeholders can quickly grasp, although the
disadvantage is that the models become large very quickly. ORMs enable you
to first explore actual data examples instead of simply jumping to a
potentially incorrect abstraction – for example Figure 3 examines the
relationship between customers and addresses in detail. For more
information about ORM, visit www.orm.net.

Figure 3. A simple Object-Role Model.

My experience is that
people will capture information in the best place that they know. As a
result I typically discard ORMs after I’m finished with them. I
sometimes user ORMs to explore the domain with project stakeholders but later
replace them with a more traditional artifact such as an LDM, a class
diagram, or even a PDM. As a generalizing specialist, someone with
one or more specialties who also strives to gain general skills and
knowledge, this is an easy decision for me to make; I know that this
information that I’ve just “discarded" will be captured in another
artifact – a model, the tests, or even the code – that I understand. A
specialist who only understands a limited number of artifacts and therefore
“hands-off" their work to other specialists doesn’t have this as an
option. Not only are they tempted to keep the artifacts that they create
but also to invest even more time to enhance the artifacts. Generalizing
specialists are more likely than specialists to travel light.

Figure 4 presents a
summary of the syntax of four common data modeling notations: Information
Engineering (IE), Barker, IDEF1X, and the Unified Modeling Language (UML).
This diagram isn’t meant to be comprehensive, instead
its goal is to provide a basic overview. Furthermore, for the sake of
brevity I wasn’t able to depict the highly-detailed approach to relationship
naming that Barker suggests. Although I provide a brief description of
each notation in Table 1 I highly suggest David Hay’s paper A
Comparison of Data Modeling Techniques as he goes into greater detail than
I do.

Figure 4.Comparing the syntax of common data modeling
notations.

Table 1.Discussing common data modeling notations.

Notation

Comments

IE

The IE notation
(Finkelstein 1989) is simple and easy to read, and is well suited for
high-level logical and enterprise data modeling. The only drawback of
this notation, arguably an advantage, is that it does not support the
identification of attributes of an entity. The assumption is that the
attributes will be modeled with another diagram or simply described in the
supporting documentation.

Barker

The Barker
notation is one of the more popular ones, it is
supported by Oracle’s toolset, and is well suited for all types of data
models. It’s approach to subtyping can become
clunky with hierarchies that go several levels deep.

IDEF1X

This notation is
overly complex. It was originally intended for physical modeling but
has been misapplied for logical modeling as well. Although popular
within some U.S. government agencies, particularly the Department of Defense
(DoD), this notation has been all but abandoned by everyone else. Avoid
it if you can.

UML

This is not an
official data modeling notation (yet). Although several suggestions for
a data modeling profile for the UML exist, none are complete and
more importantly are not “official" UML yet. However, the Object
Management Group (OMG) in December 2005 announced an RFP for data-oriented
models.

It is critical for an
application developer to have a grasp of the fundamentals of data modeling so
they can not only read data models but also work effectively with Agile DBAs
who are responsible for the data-oriented aspects of your project. Your goal
reading this section is not to learn how to become a data modeler, instead it
is simply to gain an appreciation of what is involved.

An entity type, also
simply called entity (not exactly accurate terminology, but very common in
practice), is similar conceptually to object-orientation’s concept of a class –
an entity type represents a collection of similar objects. An entity type
could represent a collection of people, places, things, events, or
concepts. Examples of entities in an order entry system would include Customer, Address, Order, Item,
and Tax. If you were class modeling you would expect to
discover classes with the exact same names. However, the difference
between a class and an entity type is that classes have both data and behavior whereas entity types just have data.

Ideally an entity should
be normal, the data modeling world’s version of cohesive. A normal
entity depicts one concept, just like a cohesive class models one
concept. For example, customer and order are clearly two different
concepts; therefore it makes sense to model them as separate entities.

Each entity type will
have one or more data attributes. For example, in Figure 1 you
saw that the Customer entity has attributes such as First
Name and Surname and in Figure 2 that the TCUSTOMER table
had corresponding data columns CUST_FIRST_NAME and CUST_SURNAME (a
column is the implementation of a data attribute
within a relational database).

Attributes should also
be cohesive from the point of view of your domain, something that is often a
judgment call. – in Figure 1 we decided that we wanted to model the fact that
people had both first and last names instead of just a name (e.g. “Scott"
and “Ambler" vs. “Scott Ambler") whereas we did not distinguish
between the sections of an American zip code (e.g.
90210-1234-5678). Getting the level of detail right can have a significant
impact on your development and maintenance efforts. Refactoring a single
data column into several columns can be difficult, database refactoring is
described in detail in Database Refactoring, although over-specifying an
attribute (e.g. having three attributes for zip code when you only needed one)
can result in overbuilding your system and hence you incur greater development
and maintenance costs than you actually needed.

Your organization should
have standards and guidelines applicable to data modeling, something you should
be able to obtain from your enterprise administrators (if they don’t exist you
should lobby to have some put in place). These guidelines should include
naming conventions for both logical and physical modeling,
the logical naming conventions should be focused on human readability whereas
the physical naming conventions will reflect technical considerations.
You can clearly see that different naming conventions were applied in
Figures 1 and 2.

As you saw in Introduction
to Agile Modeling, AM includes the Apply Modeling Standards practice. The
basic idea is that developers should agree to and follow a common set of
modeling standards on a software project. Just like there is value in following
common coding conventions, clean code that follows your chosen coding
guidelines is easier to understand and evolve than code that doesn't, there is
similar value in following common modeling conventions.

In the real world
entities have relationships with other entities. For example, customers
PLACE orders, customers LIVE AT addresses, and line items ARE PART OF
orders. Place, live at, and are part of are all terms that define
relationships between entities. The relationships between entities are
conceptually identical to the relationships (associations) between objects.

Figure 5 depicts a
partial LDM for an online ordering system. The first thing to notice is
the various styles applied to relationship names and roles – different
relationships require different approaches. For example the relationship
between Customer and Order has two names, places and is
placed by, whereas the relationship between Customer and Address has
one. In this example having a second name on the relationship, the idea
being that you want to specify how to read the relationship in each direction, is redundant – you’re better off to find a clear
wording for a single relationship name, decreasing the clutter on your
diagram. Similarly you will often find that by specifying the roles that
an entity plays in a relationship will often negate the need to give the
relationship a name (although some CASE tools may inadvertently force
you to do this). For example the role of billing address and
the label billed to are clearly redundant, you really only
need one. For example the role part of that Line
Item has in its relationship with Order is
sufficiently obvious without a relationship name.

You also need to
identify the cardinality and optionality of a relationship (the UML combines
the concepts of optionality and cardinality into the single concept of
multiplicity). Cardinality represents the concept of “how many"
whereas optionality represents the concept of “whether you must have
something." For example, it is not enough to know that customers place
orders. How many orders can a customer place? None,
one, or several? Furthermore, relationships are two-way streets:
not only do customers place orders, but orders are placed by customers.
This leads to questions like: how many customers can be enrolled in any
given order and is it possible to have an order with no customer
involved? Figure 5 shows that customers place zero or more orders and
that any given order is placed by one customer and one customer only. It
also shows that a customer lives at one or more addresses and that any given
address has zero or more customers living at it.

Although the UML
distinguishes between different types of relationships – associations,
inheritance, aggregation, composition, and dependency – data modelers often
aren’t as concerned with this issue as much as object modelers
are. Subtyping, one application of inheritance, is often found in data
models, an example of which is the is a relationship between Item and
it’s two “sub entities" Service and Product.
Aggregation and composition are much less common and typically must be implied
from the data model, as you see with the part of role that Line
Item takes with Order. UML dependencies are typically
a software construct and therefore wouldn’t appear on a data model, unless of
course it was a very highly detailed physical model that showed how views,
triggers, or stored procedures depended on other aspects of the database
schema.

Some data modelers
will apply common data model patterns, David Hay’s book Data Model
Patternsis the best reference on the subject, just as
object-oriented developers will apply analysis patterns (Fowler 1997; Ambler
1997) and design patterns (Gamma et al. 1995). Data model patterns
are conceptually closest to analysis patterns because they describe solutions
to common domain issues. Hay’s book is a very good reference for anyone
involved in analysis-level modeling, even when you’re taking an object
approach instead of a data approach because his patterns model business
structures from a wide variety of business domains.

There are two
fundamental strategies for assigning keys to tables. First, you could
assign a natural key which is one or more existing data attributes that are
unique to the business concept. The Customer table of Figure
6 there was two candidate keys, in this case CustomerNumber and SocialSecurityNumber. Second, you could
introduce a new column, called a surrogate key, which is a key that has no
business meaning. An example of which is the AddressID column
of the Address table in Figure 6. Addresses don’t
have an “easy" natural key because you would need to use all of the
columns of the Address table to form a key for itself (you
might be able to get away with just the combination of Street and ZipCode depending on your problem domain),
therefore introducing a surrogate key is a much better option in this case.

Let's consider Figure
6 in more detail. Figure 6 presents an alternative design to that
presented in Figure 2, a different naming convention was adopted and the
model itself is more extensive. In Figure 6 the Customer table
has the CustomerNumber column as its
primary key and SocialSecurityNumber as
an alternate key. This indicates that the preferred way to access customer
information is through the value of a person’s customer number although your
software can get at the same information if it has the person’s social security
number. The CustomerHasAddress table
has a composite primary key, the combination of CustomerNumber and AddressID. A foreign key is one or more
attributes in an entity type that represents a key, either primary or
secondary, in another entity type. Foreign keys are used to maintain
relationships between rows. For example, the relationships
between rows in the CustomerHasAddress table
and the Customer table is maintained by the CustomerNumber column within the CustomerHasAddress table. The interesting
thing about the CustomerNumber column
is the fact that it is part of the primary key for CustomerHasAddress as
well as the foreign key to the Customer table. Similarly,
the AddressID column is part of the
primary key of CustomerHasAddress as
well as a foreign key to the Address table to maintain the
relationship with rows of Address.

Although the
"natural vs. surrogate" debate is one of the great religious issues
within the data community, the fact is that neither strategy is perfect and
you'll discover that in practice (as we see in Figure 6 ) sometimes
it makes sense to use natural keys and sometimes it makes sense to use
surrogate keys. In Choosing a Primary Key: Natural or Surrogate? I
describe the relevant issues in detail.

Data normalization is
a process in which data attributes within a data model are organized to
increase the cohesion of entity types. In other words, the goal of data
normalization is to reduce and even eliminate data redundancy, an important
consideration for application developers because it is incredibly difficult
to stores objects in a relational database that maintains the same
information in several places. Table 2 summarizes the three
most common normalization rules describing how to put entity types into a
series of increasing levels of normalization. Higher levels of data
normalization (Date 2000) are beyond the scope of this book. With
respect to terminology, a data schema is considered to be at the level of
normalization of its least normalized entity type. For example, if all
of your entity types are at second normal form (2NF) or higher then we say
that your data schema is at 2NF.

Table 2.Data Normalization Rules.

Level

Rule

First normal form
(1NF)

An entity type is in
1NF when it contains no repeating groups of data.

Second normal form
(2NF)

An entity type is in
2NF when it is in 1NF and when all of its non-key attributes are fully
dependent on its primary key.

Third normal form
(3NF)

An entity type is in
3NF when it is in 2NF and when all of its attributes are directly dependent
on the primary key.

Why data
normalization? The advantage of having a highly normalized data schema is
that information is stored in one place and one place only, reducing the
possibility of inconsistent data. Furthermore, highly-normalized data
schemas in general are closer conceptually to object-oriented schemas because
the object-oriented goals of promoting high cohesion and loose coupling between
classes results in similar solutions (at least from a data point of
view). This generally makes it easier to map your objects to your
data schema. Unfortunately, normalization usually comes at a performance
cost. With the data schema of Figure 7 all the data for a
single order is stored in one row (assuming orders of up to nine order items),
making it very easy to access. With the data schema of Figure 7 you
could quickly determine the total amount of an order by reading the single row
from the Order0NF table. To do so with the data schema
of Figure 8 you would need to read data from a row in the Order table,
data from all the rows from the OrderItem table
for that order and data from the corresponding rows in the Item table
for each order item. For this query, the data schema of Figure 7 very
likely provides better performance.

Figure 7.An Initial Data Schema for Order (UML Notation).

Figure 8.A normalized schema in 3NF (UML
Notation).

In class modeling, there
is a similar concept called Class Normalization although that is
beyond the scope of this article.

3.8 Denormalize to Improve Performance

Normalized data schemas,
when put into production, often suffer from performance problems. This
makes sense – the rules of data normalization focus on reducing data
redundancy, not on improving performance of data access. An important
part of data modeling is to denormalize portions of
your data schema to improve database access times. For example, the data
model of Figure 9 looks nothing like the normalized schema of Figure
8. To understand why the differences between the schemas exist you must consider the performance needs of the
application. The primary goal of this system is to process new orders from
online customers as quickly as possible. To do this customers need to be
able to search for items and add them to their order quickly, remove items from
their order if need be, then have their final order totaled and recorded
quickly. The secondary goal of the system is to the
process, ship, and bill the orders afterwards.

Figure 9. A Denormalized Order Data Schema (UML notation).

To denormalize
the data schema the following decisions were made:

To support quick searching of item information the Item table
was left alone.

To support the addition and removal of order items to
an order the concept of an OrderItem table
was kept, albeit split in two to support outstanding orders and fulfilled
orders. New order items can easily be inserted into the OutstandingOrderItem table, or removed
from it, as needed.

To support order processing the Order and OrderItem tables were reworked into pairs
to handle outstanding and fulfilled orders respectively. Basic order
information is first stored in the OutstandingOrder and OutstandingOrderItem tables and then when
the order has been shipped and paid for the data is then removed from
those tables and copied into the FulfilledOrder and FulfilledOrderItem tables
respectively. Data access time to the two tables for outstanding
orders is reduced because only the active orders are being stored
there. On average an order may be outstanding for a couple of days,
whereas for financial reporting reasons may be stored in the fulfilled
order tables for several years until archived. There is a performance
penalty under this scheme because of the need to delete outstanding orders
and then resave them as fulfilled orders, clearly something that would
need to be processed as a transaction.

The contact information for the person(s) the order is
being shipped and billed to was also denormalized
back into the Ordertable,
reducing the time it takes to write an order to the database because there
is now one write instead of two or three. The retrieval and deletion
times for that data would also be similarly improved.

Note that if your
initial, normalized data design meets the performance needs of your application
then it is fine as is. Denormalization should
be resorted to only when performance testing shows that you have a problem with
your objects and subsequent profiling reveals that you need to improve database
access time. As my grandfather said, if it ain’t
broke don’t fix it.

Evolutionary data
modeling is data modeling performed in an iterative and incremental
manner. The article Evolutionary Development explores
evolutionary software development in greater detail. Agile data modeling
is evolutionary data modeling done in a collaborative manner. The
article Agile Data Modeling: From Domain Modeling to Physical Modeling works
through a case study which shows how to take an agile approach to data
modeling.

Although you wouldn’t
think it, data modeling can be one of the most challenging tasks that an Agile
DBA can be involved with on an agile software development project. Your
approach to data modeling will often be at the center of any controversy
between the agile software developers and the traditional data professionals
within your organization. Agile software developers will lean towards an
evolutionary approach where data modeling is just one of many activities
whereas traditional data professionals will often lean towards a big
design up front (BDUF) approach where data models are the primary
artifacts, if not THE artifacts. This problem results from a combination
of the cultural impedance mismatch, a misguided need to enforce the "one
truth", and “normal" political maneuvering within your
organization. As a result Agile DBAs often find that navigating the
political waters is an important part of their data modeling efforts.

How do you improve your
data modeling skills? Practice, practice, practice. Whenever
you get a chance you should work closely with Agile DBAs, volunteer to model
data with them, and ask them questions as the work
progresses. Agile DBAs will be following the AM practice Model With
Others so should welcome the assistance as well as the questions – one of
the best ways to really learn your craft is to have someone as “why are you
doing it that way". You should be able to learn physical data
modeling skills from Agile DBAs, and often logical data modeling skills as well.

Similarly you should
take the opportunity to work with the enterprise architects within your
organization. As you saw in Agile Enterprise Architecture they
should be taking an active role on your project, mentoring your project team in
the enterprise architecture (if any), mentoring you in modeling and
architectural skills, and aiding in your team’s modeling and development
efforts. Once again, volunteer to work with them and ask questions when
you are doing so. Enterprise architects will be able to teach you
conceptual and logical data modeling skills as well as instill an appreciation
for enterprise issues.

You also need to do some
reading. Although this article is a good start it is only a brief
introduction. The best approach is to simply ask the Agile DBAs that you
work with what they think you should read.

My
final word of advice is that it is critical for application developers to
understand and appreciate the fundamentals of data modeling. This is a
valuable skill to have and has been since the 1970s. It also provides a
common framework within which you can work with Agile DBAs, and may even prove
to be the initial skill that enables you to make a career transition into
becoming a full-fledged Agile DBA - See more at:
http://www.agiledata.org/essays/dataModeling101.html#sthash.IXZ1MkpE.dpuf