Tuesday, February 21, 2012

Auto-generated date dimension tables

It seems that whenever I have a cross-continent flight, Mondrian
gets a new feature. This particular flight was from Florida back home
to California, and this particular feature is a time-dimension
generator.

I was on the way home from an all-hands at Pentaho's Orlando,
Florida headquarters, where new CEO Quentin Gallivan had outlined his
strategy for the company. I also got to spend time with the many smart
folks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but still
passionately advocating for open source databases, open source
business intelligence, and above all, keeping it simple.

Roland and I got talking about how to map Mondrian onto operational
schemas. Though not designed as star schemas, some operational schemas
nevertheless have a structure that can support a cube, with a central
fact table surrounded by star or snowflake dimension tables. Often the
one thing missing is a time dimension table. Since these time
dimension tables look very much the same, how easy would it be for
Mondrian to generate them on the fly? Not that difficult, I thought,
as the captain turned off the "fasten seatbelts" sign and I opened my
laptop. Here's what I came up with.

Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero.

YYMMDD

yymmdd

Integer

120219

Decimal date with two-digit year

YYYYMMDD

yyyymmdd

Integer

20120219

Decimal date with four-digit year

DATE

the_date

Date

2012-12-31

Date literal

DAY_OF_WEEK_NAME

the_day

String

Friday

Name of day of week

MONTH_NAME

the_month

String

December

Name of month

YEAR

the_year

Integer

2012

Year

DAY_OF_MONTH

day_of_month

String

31

Day ordinal within month

WEEK_OF_YEAR

week_of_year

Integer

53

Week ordinal within year

MONTH

month_of_year

Integer

12

Month ordinal within year

QUARTER

quarter

String

Q4

Name of quarter

Suppose you wish to choose specific column names, or have more
control over how values are generated. You can do that by including a
<ColumnDefs> element within the table, and <ColumnDef>
elements within that — just like a regular <Table>
element.

The nested <Key> element makes that column valid as the
target of a link (from a foreign key in the fact table, for instance),
and also declares the column as a primary key in the CREATE TABLE
statement. This has the pleasant side-effect, on all databases I know
of, of creating an index. If you need other indexes on the generated
table, create them manually.

The <TimeDomain> element could be extended further. For instance, we
could add a locale attribute. This would allow different translations
of month and weekday names, and also support locale-specific
differences in how week-in-day and day-of-week numbers are
calculated.

Note that this functionality is checked into the mondrian-lagunitas
branch, so will only be available as part of Mondrian version 4. That
release is still pre-alpha. We recently started to regularly build the
branch using Jenkins, and you should see
the number of failing tests dropping steadily over the next weeks and
months. Already over 80% of tests pass, so it's worth downloading the
latest build to kick the tires on your application.

As a minor variant on this idea, you can also make a table that consists of one column (the_date). You can then build a view where all of the other columns are date function calls. I will admit that it amused me to have a database row that is only 4 bytes, but this is a pretty trivial advantage. By using functions, the names can take advantage of localization rules built into the database engine.

An drawback is that the syntax of the view becomes specific to a RSBMS. It also require that the fact table have a date column.

then the database will probably do a full scan of the time_by_day table. If instead the time_by_day table had a the_year column, it could do an index scan.

You might say, what's the difference -- the table only has 3000 rows for 10 years, and that's a small amount of memory on a modern system. But database query optimizers can't easily combine functions, whereas they can combine indexes.

Kimball introduced the "star schema" methodology, including time dimension tables, about 15 years ago, largely to deal with the stupidity of database query optimizers. Database query optimizers have improved a lot, but they still need all the help they can get.

If you don't believe me, try some experiments against a reasonable sized database.

In Postgresql, I can index functions of the date column, so I can index any of the columns in the view, as long as the functions are deterministic. In my test query, the query optimizer was able to do an index scan when grouped by a column in the view. As predicted, it did make an improvement in the query time (600 ms without index to 400 ms with index). I must admit that surprised me a bit, since i thought that expensive aggregate was over the fact table. But I can get the same performance with either a table or a view since I have essentially the same indexes.

I have always though of the star schema as a logical design ,more than a technique to improve performance. So thanks for that insight.

(1) Mondrian isn't smart enough to know that to find the dates between two points, it just needs to a range scan on date_id. (I can't find all customers in San Francisco by using a range of customer_id... how is Mondrian to know that the date dimension is different?)

(2) Database query optimizers and operators sometimes don't work as well on ranges as '=' predicates.

I'm not claiming that other approaches won't work. I just claim that generated date dimensions are the simplest. And then one can move onto solving the harder problems in the schema.