Sunday, January 18, 2009

The Data Dictionary and Calculations, Part 1

The stunning power of a data dictionary comes into play once
the dictionary contains formulas for calculated values.
The dictionary can then be used to generate code, and also
to generate documentation. This double-win is not available
without the calculations because the resulting docs and
database would be incomplete, requiring tedious and
error-prone manual completion.

There are links to related essays on normalization and denormalization at the bottom of this post.

Calculations and Normalization

Before I begin, I will point out that all calculated values
stored in a database are denormalizing, they all
introduce redundancies. This does not mean they are bad,
it just means you need a way to make sure they stay
correct (see Keeping Denormalized Values Correct, also see Normalization and Automation). If you cannot
keep them correct, they will get very bad very fast. This essay
will show you one approach to ensuring calculated values
are always correct.

However, before I start, I have to point out how important
it is to begin by normalizing your database (to at least 3NF)
and adding calculations only upon the strong foundation
of a normalized database. If you do not normalize first,
you will discover that it is impossible to work up formulas
that make any sense -- values will always seem to be not quite
where you need them, and it will always seem you need one more
kind of calculation to support, and it will be very difficult
to write the code generator that gives strong results.
But if you build on a normalized database, it turns out you
only need a few features in your dictionary and your code
generator.

Use Denormalization Patterns

Once you have normalized your database, you will find that
your calculations all fall into three basic categories
(detailed in April 2008 in Denormalization Patterns). These three patterns are:

FETCH operations, like copying an item's price from
the ITEMS table to the ORDERLINES table.

EXTEND operations, which are calculations within a row,
such as assigning EXTENDED_PRICE the value of QUANTITY * PRICE.

AGGREGATE operations, like a SUM of the lines of an order
to the order header.

This week we will look at the first type of operations,
the FETCH operations.

Putting the FETCH Into Your Data Dictionary

So we have an ORDERLINES table, and it contains a PRICE
column, and the value of that column should be copied from
the ITEMS table. This is an extremely common operation in
most database applications, so we decide it would be really
cool if we could specify that in the data dictionary and have
the code generator take care of it. This would chop a lot of
labor off the development process.

Here is how a column like this would appear in my own
dictionary format:

This looks nice, I have put the formula for the PRICE column into
the data dictionary. Now of course I need that formula to get
out into the application somehow so that it will always be
executed and will never be violated. We will now see how to
do that.

The Trigger Approach

When it comes to code generators, if there are ten programmers
in a room, there are going to be at least 10 opinions
on how to write and use a code generator (the non-programmer boss
will also have an opinion, so that makes 11). I have no interest
in bashing anybody's approach or trying to list all of the
possibilities, so I will stick with the approach I use myself,
which is to generate database trigger code. If you want to know
why that approach works for me, check out Triggers, Encapsulation and Composition.

When I work on code generators, I begin by manually coding an
example of what I'm getting at, so I know it works. The trigger
snippet we are looking for must do two things. It must make sure
the price is always copied, and it must make sure that no user
can subvert the value. This snippet (which is in the PostgreSQL
flavor of server-side SQL) does this on an insert:

-- top of trigger....
-- PART 1: Prevent users from subverting the
-- the formula by throwing error if they
-- try to supply a value:
IF new.price IS NOT NULL THEN
ErrorCount = ErrorCount + 1;
ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
END IF;
-- PART 2: If the value of SKU exists, use it to look
-- up the price and copy it into the new row
IF new.sku IS NOT NULL THEN
SELECT INTO new.price par.price
FROM items par
WHERE new.sku = par.sku ;
END IF;
-- more trigger stuff

NOTE! You may notice my trigger code somehow seems to "know" to
use the SKU column when searching the ITEMS table, yet my formula
did not specify that. I am assuming your data dictionary contains
definitions of primary keys and foreign keys, otherwise it is of
no real use. I am further assuming that when I see the formula
to "FETCH" from the ITEMS table, I can look up the foreign key that
matches ORDERLINES to ITEMS and find out what column(s) to use.

The example above works on INSERT operations only. You need
a slightly different version for updates, which throws an error
if the user attempts to change the price, and which does a new
FETCH if the user has changed the SKU value.

Sidebar: A Complete Trigger

If you want a teaser on how many amazing things the trigger can
do once you've loaded up your dictionary and builder with features,
here is a bit of code from a demo application. Most everything in
it will get treated in this series on the data dictionary.

Variatons on FETCH

I have found two variations on FETCH that have proven very
useful in real world applications.

The first I call DISTRIBUTE. It is dangerous because it can be
a real performance killer, and turns out you very rarely need it.
However, that being said, sometimes you want to copy a value from
a parent table down to every row in a child table when the value
changes in the parent. The first time I did this was to copy the
final score from a GAMES table into a WAGERS table on a fake
sports betting site.

The other variation I have found useful is FETCHDEF, my shorthand
for "fetch by default." In this variation the user is free to
supply a value of their own, but if they do not supply a value then
it will be fetched for them.

The Code Generator Itself

As for writing the code generator itself, that is of course far more
than I can cover in one blog entry or even 10. Morever, since
anybody who decides to do so will do so in their own language
and in their own style, there is little to be gained by showing
code examples here.

Conclusion: Expand Your Dictionary!

If you make up a data dictionary that only contains structure
information like columns and keys, and you write a builder program
to build your database, you can get a big win on upgrades and
installs. However, you can take that win much farther by adding
calculated values to your database and expanding your builder
to write trigger code. This week we have seen what it looks like
to implement a FETCH calculation in your dictionary and what the
resulting trigger code might look like.

3 comments:

Anonymous
said...

It would be helpful if the author provides a working example of the logic above for a popular database such as Northwind (MS SQL Server) in the corresponding flavour of the SQL language. Otherwise it sounds a bit academic and ... not very clear.

Pages

In the Top 200 Tech Blogs!

In 2008 Datamation listed The Database Programmer as one of the top 200 tech blogs, on the same page with James Gosling (!), Coding Horror, Dzone, and the author's own personal hero, Dave Thomas (if you haven't bought "The Pragmatic Programmer", go online and buy it now):