The Simple Case Is Not Much Help

We will begin by examining a simple case of a shopping
cart. We have columns QTY and PRICE, and we want to
add the column EXTENDED_PRICE that will contain
PRICE * QTY. Our dictionary might look something like
this:

This seems simple enough, we have specified the formula right in
the table definition, and now we are free to make use of that
formula in any way we want -- either by generating code or
interpreting it at run-time.

Unfortunately it is a bad idea to start coding right now
with this example. The problem is that it is too simple,
and will lead us down paths that cause great problems when
we hit more complex cases. We must begin with a more complex
case before we consider how to use this formula in our
framework.

Not All Items Are Taxable

Consider the case where you have a shopping cart online and
you must collect sales tax, but not all items are taxable.
This means you need conditional logic of some sort, you must
look at a flag and then decide whether to add tax. Here is
a first stab at what this might look like:

While this looks like a simple enough extension to the first
example, it gets us into a thorny decision, the decision
between parsing and assembling

Parse Versus Assemble

Before I get into the parse vs. assemble, question, let me
pull back and explain why the example bothers me, and why
it is worth an entire essay to discuss. In short, we intend
to use the dictionary to implement a radical form of DRY -
Don't Repeat Yourself (see The Wikipedia article on DRY.) Once we have specified the
formula in the dictionary, we want to use it for all
code generation and docs generation at very least, but we
may also want to refer to the formulas in Java code (or PHP,
Python, Ruby etc.) or even in Javascript code on the browser.

But the problem with the example is that it is coded in
SQL. In the form I presented, it can be used for generating
triggers, but not for anything else, unless you intend to
use a parser to split it all up into pieces that can be
reassembled for different presentations. The example as
written is useful only for a single purpose -- but everything
in our dictionary ought to be useful at any layer in the
framework for any purpose.

But it gets worse. What if the programmer uses a dialect
of SQL aimed for one platform that does not work on another?
To guarantee cross-server compatibility, we not only have to
parse the phrase, but then re-assemble it.

There is a third argument against the use of SQL expressions.
We may be able to parse the expression and satisfy ourselves
that it is valid, but that still does not mean it will
work -- it may refer to non-existent columns or require
typecasts that the programmer did not provide. This leads to
one terrible event that you ought to be able to prevent when
you use a dictionary: having an upgrade run successfully only
to hit a run-time error when somebody uses the system.

A much simpler method is to assemble expressions by
having the programmer provide formulas that are already cut up
into pieces.

The Assembly Route

So I want to have formulas, including conditionals, and I want
to be able to use the formulas in PHP, Java, Javascript, inside
of triggers, and I want to be able to generate docs out of them
that do not contain code fragments, and I want to be able to
guarantee when an upgrade has run that there will be no errors
introduced through programmer mistakes in the dictionary.
The way to do this is to specify the formulas a little
differently:

A case without a compare is unconditional, it always
returns and processing ends

I stuck little '@' signs in front of column names,
I will explain those in a moment.

In short, we want the programmer to provide us with the
conditional statements already parsed out into little pieces,
so when we load them they look like data instead of code.
We now have the responsibility for assembling code fragments,
but in exchange we have pre-parsed data that can be handed
to any programming language and used.

Conclusion: Assembly Means Data

The decision to go the assembly route is simply another example
of the Minimize Code, Maximize Data principle. The dictionary
itself should be composed entirely of data values, no code snippets
should be allowed to sneak in. The reason is simple. No matter what
route we follow we will have to validate and assemble the formula -
be it for PHP, Javascript, or an alternate database server. But if
we let the programmer give us code snippets we have the extra
burden of parsing as well. Who needs it?

Sunday, January 18, 2009

The stunning power of a data dictionary comes into play once
the dictionary contains formulas for calculated values.
The dictionary can then be used to generate code, and also
to generate documentation. This double-win is not available
without the calculations because the resulting docs and
database would be incomplete, requiring tedious and
error-prone manual completion.

There are links to related essays on normalization and denormalization at the bottom of this post.

Calculations and Normalization

Before I begin, I will point out that all calculated values
stored in a database are denormalizing, they all
introduce redundancies. This does not mean they are bad,
it just means you need a way to make sure they stay
correct (see Keeping Denormalized Values Correct, also see Normalization and Automation). If you cannot
keep them correct, they will get very bad very fast. This essay
will show you one approach to ensuring calculated values
are always correct.

However, before I start, I have to point out how important
it is to begin by normalizing your database (to at least 3NF)
and adding calculations only upon the strong foundation
of a normalized database. If you do not normalize first,
you will discover that it is impossible to work up formulas
that make any sense -- values will always seem to be not quite
where you need them, and it will always seem you need one more
kind of calculation to support, and it will be very difficult
to write the code generator that gives strong results.
But if you build on a normalized database, it turns out you
only need a few features in your dictionary and your code
generator.

Use Denormalization Patterns

Once you have normalized your database, you will find that
your calculations all fall into three basic categories
(detailed in April 2008 in Denormalization Patterns). These three patterns are:

FETCH operations, like copying an item's price from
the ITEMS table to the ORDERLINES table.

EXTEND operations, which are calculations within a row,
such as assigning EXTENDED_PRICE the value of QUANTITY * PRICE.

AGGREGATE operations, like a SUM of the lines of an order
to the order header.

This week we will look at the first type of operations,
the FETCH operations.

Putting the FETCH Into Your Data Dictionary

So we have an ORDERLINES table, and it contains a PRICE
column, and the value of that column should be copied from
the ITEMS table. This is an extremely common operation in
most database applications, so we decide it would be really
cool if we could specify that in the data dictionary and have
the code generator take care of it. This would chop a lot of
labor off the development process.

Here is how a column like this would appear in my own
dictionary format:

This looks nice, I have put the formula for the PRICE column into
the data dictionary. Now of course I need that formula to get
out into the application somehow so that it will always be
executed and will never be violated. We will now see how to
do that.

The Trigger Approach

When it comes to code generators, if there are ten programmers
in a room, there are going to be at least 10 opinions
on how to write and use a code generator (the non-programmer boss
will also have an opinion, so that makes 11). I have no interest
in bashing anybody's approach or trying to list all of the
possibilities, so I will stick with the approach I use myself,
which is to generate database trigger code. If you want to know
why that approach works for me, check out Triggers, Encapsulation and Composition.

When I work on code generators, I begin by manually coding an
example of what I'm getting at, so I know it works. The trigger
snippet we are looking for must do two things. It must make sure
the price is always copied, and it must make sure that no user
can subvert the value. This snippet (which is in the PostgreSQL
flavor of server-side SQL) does this on an insert:

-- top of trigger....
-- PART 1: Prevent users from subverting the
-- the formula by throwing error if they
-- try to supply a value:
IF new.price IS NOT NULL THEN
ErrorCount = ErrorCount + 1;
ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
END IF;
-- PART 2: If the value of SKU exists, use it to look
-- up the price and copy it into the new row
IF new.sku IS NOT NULL THEN
SELECT INTO new.price par.price
FROM items par
WHERE new.sku = par.sku ;
END IF;
-- more trigger stuff

NOTE! You may notice my trigger code somehow seems to "know" to
use the SKU column when searching the ITEMS table, yet my formula
did not specify that. I am assuming your data dictionary contains
definitions of primary keys and foreign keys, otherwise it is of
no real use. I am further assuming that when I see the formula
to "FETCH" from the ITEMS table, I can look up the foreign key that
matches ORDERLINES to ITEMS and find out what column(s) to use.

The example above works on INSERT operations only. You need
a slightly different version for updates, which throws an error
if the user attempts to change the price, and which does a new
FETCH if the user has changed the SKU value.

Sidebar: A Complete Trigger

If you want a teaser on how many amazing things the trigger can
do once you've loaded up your dictionary and builder with features,
here is a bit of code from a demo application. Most everything in
it will get treated in this series on the data dictionary.

Variatons on FETCH

I have found two variations on FETCH that have proven very
useful in real world applications.

The first I call DISTRIBUTE. It is dangerous because it can be
a real performance killer, and turns out you very rarely need it.
However, that being said, sometimes you want to copy a value from
a parent table down to every row in a child table when the value
changes in the parent. The first time I did this was to copy the
final score from a GAMES table into a WAGERS table on a fake
sports betting site.

The other variation I have found useful is FETCHDEF, my shorthand
for "fetch by default." In this variation the user is free to
supply a value of their own, but if they do not supply a value then
it will be fetched for them.

The Code Generator Itself

As for writing the code generator itself, that is of course far more
than I can cover in one blog entry or even 10. Morever, since
anybody who decides to do so will do so in their own language
and in their own style, there is little to be gained by showing
code examples here.

Conclusion: Expand Your Dictionary!

If you make up a data dictionary that only contains structure
information like columns and keys, and you write a builder program
to build your database, you can get a big win on upgrades and
installs. However, you can take that win much farther by adding
calculated values to your database and expanding your builder
to write trigger code. This week we have seen what it looks like
to implement a FETCH calculation in your dictionary and what the
resulting trigger code might look like.

Sunday, January 11, 2009

When you set out to use a Data Dictionary to control your
database upgrades, you must consider not just columns and
tables, but also indexes and keys. The process for adding
indexes and keys is almost the same as that for columns and
tables, but there are a few wrinkles you have to be aware
of.

Review of Basic Ideas

In my First Essay On Data Dictionaries, one of the major points
was that the dictionary is easiest to use if it is in some
type of plaintext format that is in source control along with
the rest of your application files, and is the processed with
a "builder" program.

Last week we saw the Basic compare operation used by the builder to
build and update tables.
You read in your data dictionary, query the information_schema
to determine the current structure of the database, and then
generate commands to add new tables and add new columns
to existing tables.

The Importance of Keys and Indexes

If a builder program is to be useful, it must be complete,
if it leaves you with manual tasks after a build then the
entire concept of automation is lost. The builder must be
able to build the entire structure of the database at
very least, and this means it must be able to work out
keys and indexes.

The Basic Steps

The basic steps of building indexes and keys are these:

Load your dictionary to some format you can work with
easily. I prefer to load it to tables.

Query the databases's INFORMATION_SCHEMA to determine
which indexes and keys already exist.

Execute some type of diff to determine which indexes
need to be built.

Build the indexes that are not there.

If you like, drop the indexes that are not in the spec.

Sidebar: Dropping Indexes

A builder program can add things, and it can also drop things.
When it comes to destructive operations, I prefer not
to have my builder drop tables or columns, because the
consequences of a mistake can be unthinkable.

However, when it comes to indexes, it is much more likely to
be ok to drop a stray index. Dropping indexes does not destroy
user data. Also, extraneous indexes will slow down inserts
and updates, so getting rid of them is usually the Right Thing.

Step 1: Your Specification

Your data dictionary format must have some way of letting
you specify an index. It is also a good idea to allow you
to specify an ascending or descending property for each
column, and to specify if the index is to be unique (effectively
making it a unique constraint).

As far as loading this into memory, I covered that in some detail
last week and will not dwell on it here. I will simply assume
you have code to parse and load the spec to a format that works
for you.

Step 2: The Information Schema or Server Tables

When I set out to write my builder, I found that the information_schema
was a bit more complicated than I needed. The server I was using,
Postgres, had a simpler way to get what I wanted. I also found I would
get all kinds of extraneous definitions of indexes on system tables
or tables that were not in my spec. The query below
was the easiest way to get index definitions that were limited to the
tables in my spec on the Postgres platform:

As far as primary keys and foreign keys go, the story is basically
the same, your server may provide them in a convenient way the
way Postgres gives index definitions, or you may have to dig a little
deeper to get precisely what you want.

Step 3: The Diff

So now we have a picture of the indexes we need to exist, and the
indexes that already exist. It is time to look at how to diff
them effectively. This step does not work the same way as it does
with columns and tables.

Before we go into how to do the diff, let's review how we did it with
tables and columns. We can basically diff tables and columns by
name. If our spec lists table CUSTOMERS and it does not appear
to exist in the database, we can build the table CUSTOMERS, simple
as that. But with indexes the name really does not mean anything,
what really matters is what columns are being indexed.

This is why we diff indexes on the column definitions, not on
their names. If you want a complete trail, you would begin with
this table that describes your own indexes:

The first row shows that the index on zipcode+state on the
customers table is in the spec and in the database, we take
no action on that index.

The second row shows that the index on bipad+year+issue is
also in both the database and the spec. This particular index
has a different name in the database, but we don't care.
(Maybe the programmer changed the name in the spec). We take
no action on this index.

The third line shows an index on the ORDERS table that is not
in the database, we must build that index.

The fourth line shows an index in the database that is not
in the spec, you can drop that if you want to.

The Rest of It

From here it is a simple matter to generate some commands to create
the indexes we need.

Keys work the same way, with a few obvious differences in how they
might be named.

We can add features from here to track if the columns are being
indexed in ascending or descending order.

Conclusion: Indexes Go By Definition

When writing a database upgrade "builder" program, they key thing
to understand about indexes and keys is that you are looking to
indentify and build indexes according to their definition,
and that names do not matter at all.

Sunday, January 4, 2009

The number one search term that brings people to this blog is
"data dictionary." So this week I will begin a
series on how to use the data dictionary to improve your own
productivity and reduce errors.

Building And Upgrading

This week we are going to see how to use a data dictionary to
eliminate upgrade scripts (mostly) and make for more efficient upgrades.
The approach described here also works for installing a system
from scratch, so an install and an upgrade become the same
process.

The major problems with upgrade scripts are these:

They are the least-tested code in any system, and are
the most likely to break. If a script breaks and anybody
but the original programmer is running the upgrade, this
leads to aborted upgrades and upset customers.

They are horribly inefficient when a customer upgrades
after a long time: the same table may be rebuilt many times
as script after script adds a column or two.

By contrast, a dictionary-based upgrade can take any customer
from any version of your software and in the fewest steps
possible bring them completely current, with no possibility
of broken scripts.

But first, a quick review of an important idea...

Review: The Text File

Back in June of 2008 this blog featured an overview of the data dictionary. There were many
ideas in that essay, but I wish to focus on one in particular,
the question of where to put the data dicationary and in
what format.

In terms of where to put it, the data dictionary should
be just another application file, in source control, and
delivered with your code. When the dictionary is in a
plaintext file (or files) and treated like other application
code, you do not have to invent any new
methods for handling it, just use the same methods you use for
the rest of your code.

In simple practical terms, it is best if a data dictionary can
be easily read and written by both people and computers.
This leads to a plaintext file or files in some format such
as JSON or YAML. I personally prefer YAML because it is a superset
of JSON, so using YAML gives me the option to sneak in JSON syntax
later if I choose, but starting with JSON does not let me go the
other way.

Requiring easy handling by people tends to rule out XML, which is
bletcherous to work with manually (in my humble opinion). Requiring
readability by the computer rules out UML unless your UML drawing
tool can produce a usable data file (comments always welcome, tell us
your favorite tool for doing this!). When considering UML, it is
the class diagrams that are most likely to be translatable into a
data dictionary.

Finally, encoding dictionary information in program
class files technically
meets the practical requirements listed above, but it has the
disadvantage of trapping data in code, which unnecessarily
couples your dictionary to whatever language you are using at the
moment. It is much better if the dictionary sits outside of the
code as pure data. Not to mention that spreading the dictionary out
in a collection of one-class-per-table files makes it much harder
to do upgrades in the way I am about to describe.

Review of Steps

When using a dictionary-based approach, you write some type of
"builder" program that reads your dictionary file, examines the
current structure of the database, and then generates SQL commands
to alter and create tables to make them all current.

There are plenty of ways to do this. My own approach is to
load the dictionary itself into tables, pull the current state into
similar tables, and then do queries to find new and altered
tables and columns. If you want to see a full-blown program of
this type, check out androBuild.php, the Andromeda implementation of this idea.
The routines that apply to today's topic include "RealityGet()",
"Differences()", "Analyze()" and "PlanMake()".

Step 1: Load the Dictionary to RAM

To use the approach in this essay, you begin by parsing your plaintext
file and loading it to tables. Here is a simple example of a what
a dictionary file might look like in YAML format:

If you are using PHP, you can parse this file using the
spyc program, which
converts the file into associative arrays. All or nearly all
modern languages have a YAML parser, check out the
YAML site to find yours.

Step 2: Load the Dictionary To Tables

The database you are building should have some tables that
you can use as a scratch area during the upgrade. You may
say, "The builder gives me tables, but I need tables to run
the builder, how can I do this?" The simplest way is to hardcode
the creation of these tables. A more mature solution would use
a separate dictionary file that just defines the dictionary tables.

The structure of the tables should match the data file, of course.
Here is what the YAML above would like like after being loaded
to tables:

Step 3: Fetch The Current State

All modern databases support the "information_schema" database
schema, a schema inside of each database that contains
tables that describe the structure of the database. While you
can make queries directly against the information_schema tables,
I prefer to fetch the information out of them into my own
tables so that all column names are consistent with my own.
A simple query to do this might look like this:

Pulling column information out can be much more complicated, owing
to differences in how vendors implemement information_schema, and
owing to the complex way data is stored in it. Here is my own code
to pull out the definitions of columns from the Postgres
information_schema, which also simplifies the definition
dramatically, to make my downstream coding easier:

Step 4: The Magic Diff

Now we can see how the magic happens. Imagine you have 20 tables
in your application, and in the past week you have modified 5 of them
and added two more. You want to upgrade your demo site, so what is
the next step for the builder?

The builder must now do a "diff" between your dictionary and the
actual state of the database, looking for:

Completely new tables.

New columns in existing tables.

Lets say you have two tables, "TABLES_SPEC" which lists the
tables in your application. Then you have "TABLES_NOW" that lists
the tables in your database. The following query will give
you a list of new tables:

SELECT spec.table_id
FROM TABLES_SPEC spec
WHERE NOT EXISTS (
SELECT table_id from TABLES_NOW now
WHERE now.table_id = spec.table_id
)

It is now a simple thing pull the column definitions for each
table and generate some DDL to create the tables.

But we also have tables that have new columns. We can pull those
out like so:

SELECT * from COLUMNS_SPEC spec
-- the first where clause gets new columns
WHERE not exists (
SELECT table_id FROM COLUMNS_NOW now
WHERE spec.table_id = now.table_id
AND spec.column_id= now.column_id
)
-- this second subquery makes sure we are
-- getting only existing tables
AND EXISTS (
SELECT table_id from TABLES_NOW now
WHERE now.table_id = spec.table_id
)

Now again it is a simple matter to generate DDL commands that
add all of the new columns into each table. Some databases
will allow multiple columns to be added in one statement, while
others will require one ALTER TABLE per new column (really horrible
when you have to do that).

Please note this is sample code only, just to give you ideas,
and it will not cover every case.

Sidebar: Avoid Destructive Actions

Do not rush into writing code that drops columns or tables that
are not in your spec. The results of a misstep can be disastrous
(as in lose your job or your customer). My own builder code is now
4 1/2 years old and I have never yet bothered to write a
"destructive" upgrade that will clean out unused tables and columns.
Maybe someday...

What I Left out: Validation

There was no space in this essay to discuss a very important
topic: validating the spec changes. It may be that a programmer
has done something nasty like change a column type from character
to integer. Most databases will fail attempting to alter the column
because they don't know how to convert the data. Your builder program
can trap these events by validating the upgrade before any
changes are made. This will be treated fully in a future essay.

More that I Left Out: Indexes, Keys...

There are many many other things you can and really must create
during the build, beginning with primary key and foreign keys,
not to mention indexes as well. These will be covered in a future
essay.

More That I Left Out: When You Still Need Scripts

There are plenty of reasons why you may still need a few
upgrade scripts, these will be discussed in a future essay.
They all come down to moving data around when table structures
change significantly.

Conclusion: One Upgrade To Rule Them All

The approach described this week for upgrading databases has
many advantages. It is first and foremost the most efficient way
to upgrade customers from any version directly to the latest
version. It is also the simplest way to handle both installations
and upgrades: they are both the same process. Putting the dictionary
file into plaintext gives you complete source control just like any
other application file, and overall you have a tight, efficient
and error-free upgrade process.

Pages

In the Top 200 Tech Blogs!

In 2008 Datamation listed The Database Programmer as one of the top 200 tech blogs, on the same page with James Gosling (!), Coding Horror, Dzone, and the author's own personal hero, Dave Thomas (if you haven't bought "The Pragmatic Programmer", go online and buy it now):