Thursday, March 12, 2009

How To Populate Your TIME Dimension

Yesterday I wrote about the Datawarehousing TIMES table. Today I populated it. I modified it for my own purposes naturally, specifically removing the fiscal components (thankfully calendar year equals fiscal year). I think that saved me days of figuring out how to calculate certain fields.

I remember the first time trying to do this took about 20 different SQL statements. I generated the key (date, time_id) and then calculate that either loop or perform another SQL statement using specific dates out of the table (last day of the month for instance). This time I managed to do it in a single SQL statement. Again, I left out some of the more complicated calculations since I did not have to worry about fiscal requirements.

NEXT_DAY returns the date of the first weekday named by char that is later than the date date. The return type is always DATE, regardless of the datatype of date. The argument char must be a day of the week in the date language of your session, either the full name or the abbreviation. The minimum number of letters required is the number of letters in the abbreviated version. Any characters immediately following the valid abbreviation are ignored. The return value has the same hours, minutes, and seconds component as the argument date.

If I can add my two cents based on our 80TB data warehouse and one single period table ...

1. I would recommend to add all date periods (not just days) but also MONTHS, QUARTERS, YEARS, ... and corresponding PERIOD_TYPE_CD which would be DAY / MONTH / QUARTER / YEAR ... this way your period table really covers all periods and can be used in all snapshot fact tables even e.g. monthly snapshot tables (for example PERIOD_KEY = 111 would correspond to 201004 - whole month).

2. in general, I always felt that PERIOD table is nice to have for anything I can predict (e.g. is next day Monday? type questions), it is critical for non-predictable information e.g. Federal Holidays, Retail Calendar, "is my servicer sending data today", ... that's where you get real interest from user community and where people stop asking why you replaced all dates with key.

3. small comment about name (I know I am too picky), I would not call it TIME dimension but PERIOD or something like that. It really tracks days, not time (hours, minutes, seconds, ...)

@Chet: great stuff... nice starting point for anyone that needs to do this.

@Jiri:

You said: "I would recommend to add all date periods (not just days) but also MONTHS, QUARTERS, YEARS, ... and corresponding PERIOD_TYPE_CD which would be DAY / MONTH / QUARTER / YEAR ... this way your period table really covers all periods and can be used in all snapshot fact tables even e.g. monthly snapshot tables (for example PERIOD_KEY = 111 would correspond to 201004 - whole month)."

I couldn't disagree more. The number one rule in dimensional modeling is to never mix grains in the same table. If you really need a "shrunken dimension"... then create a new dimension table called, called MONTH_DIM for instance, that is not at the grain of an individual day, but instead at the grain of a month. This is also easy to do with a view over the daily dimension table. That's also why I think the date dimension table should always be called DATE_DIM. A dimension table name should always specify the grain of the table.

If you go with your PERIOD_TYPE approach... every single join of a fact table to the PERIOD table would require a filter on PERIOD table. Why do that? It just begs for a user to make a mistake and overallocate. Keep it simple... use another table or view.

However... if you use Oracle, why go through the process of building aggregate tables at all? The database will do this for you with fairly little effort. Search for "oracle query rewrite" and see where that takes you.