Data Modeling

March 05, 2009

This is the next post in a series of discussions about data warehousing best practices based on this blog entry from February 11. Today, we look at capturing history in a data warehouse. "The data warehouse is able to present a view of the business at a
particular point in time and track Key Performance Indicators (KPI's)
over time." There's not too much to argue here.

One of the primary functions of a data warehouse is to capture history and perform point-in-time reporting and analysis. There are a couple of ways to do this depending on the architecture and requirements. The most common method in dimensional modeling is the Type 2 Slowly Changing Dimension (SCD). Type 2 SCD's track tie a transactional or fact record to it's dimension attributes, as it was at the time it occurred. Variations on the type 2 SCD will allow us to query the transaction as it was at the time it occurred or as it appears now. If we don't care how it was in the past and only care how it looks now, we'll use a Type 1 SCD.

Let's look at an example using a hotel transaction. Suppose we are looking at a transaction that occurred on 12/20/08. The transaction includes the room rate plus taxes that the guest paid, the type of payment he used, the guest's name and rewards number, and the hotel brand name and location.

Let's say on 1/1/09, the hotel company changed the brand name of that property to Fairview Inn from Village Suites. Hotel companies, such as Hilton and Marriott, own a number of brands, and there are occassions that they may convert one brand into another.

On 2/1/09, we run a report showing sales by brand for each month in 2008. Do we include that hotel's 2008 transactions under Fairview or Village? That probably depends on the use. If we are tracking revenue in order to calculate bonuses for the Village Suites VP, then he would certainly want to us to include those transactions in his bucket (as-was). However, if we are using last year sales to project Fairview sales for 2009, then we want to include the changed properties sales with Fairview (as-is).

In this case, we need to be able to query it both "as-is" and "as-was". A flexibile architecture will allow us to do either one. Much of the time, the "as-was" serves no purpose. For example, if we have status codes, and we change the description of a particular code. Most of the time, we'll want to see all records associated with the new description.

Determing which method we use to track history is one of the most important decisions we make in the data warehouse design. It affects the flexibility of the warehouse, as well as performance. We have to balance each. However, it is also one way to derive the greatest value from the solution, since this historical information is rarely available in any consummable form anywhere else.

February 27, 2009

We are continuing a series of discussions about data warehousing best practices based on my blog entry here. This time we take on choosing the right granularity in a data warehouse. The rule of thumb, and the point I made in the original blog entry was that you should always choose the lowest level of granularity. That makes sense in that you can always aggregate to a higher level, but you can't go back down to detail if the data is not stored.

I can give an example of one data warehouse that I am familiar with that has won awards and accolades for best practices. However, a closer look reveals several trade-offs that were made during design that have had far and long lasting effects. During the design process, the decision was made to drop a lot of detail data in order to save space. This data, which many would consider essential to any data warehouse, contained transactional details and customer information. The data warehouse was built for financial use, and the finance organization was unconcerned about these details. They were looking at metrics by products and the chart of accounts.

Was this necessary? Arguably, it was. Since IT projects were funded by the business organization, the finance organization had no incentive to pay for storage or CPU required to process more data than was absolutely required. They simply made the decision to design for their own users' requirements.

However, this decision also had several long term effects that may have not been taken into consideration. Since the warehouse did not contain details that were important to other departments, such as marketing, sales, and operation, the warehouse was rendered useless for a large segment of the company. These organizations, in turn, were then forced to build their own data warehouses from the same source data in order to satisfy their users' needs. The end result being, as a company, they spent more building and maintaining multiple data warehousing systems than by investing more in the first one.

Another effect that had considerable cost was the other organizations' difficulty reconciling their numbers to the finance organization. Since the detail source data was not maintained by finance, and their business rules were buried in thousands of lines of code, users in the other departments spent countless hours trying to resolve differences between the systems.

So, was the right choice to build one true enterprise data warehouse to serve the entire organization?

In theory, the enterprise data warehouse is the best solution. However, we recognize that the larger the organization, the more difficult this is to accomplish. Enterprise level designs in 50 billion dollar + companies become almost a fantasy. There have been a few documented successes, but the kinds of investment involved in these projects take years and millions of dollars.

It is important that there be a chief architect overseeing the development of all information systems. That person must have a handle on the current environment and a vision for integration and reduced costs. I have met a few of these guys who had the title and responsibility but lacked the vision. I have also met a few who were convinced that silos were simply more cost effective and maintainable. Every organization is different and what works for one might not work for another.

So back to the original consideration for this post. Should a data warehouse reflect the lowest level of granularity? In general, yes, it should, but there are a number of other factors that must be considered when making the decision. Ultimately, the answer is that the granularity of the data should reflect the lowest level that has the greatest use to the greatest number of users.

February 19, 2009

In an earlier post, Data Warehouse Basics, I reprinted something that
I written a while back about what makes a data warehouse. In a series of entries, I am
exploring each point in further detail and see if the meaning still holds
up. In this post, we take on "Data is conformed".

Conformity in a data warehouse occurs means essentially two things, conformed dimensions and conformed facts. Conformity allows data warehouses to be distributed across the enterprise and unanticipated new sources to be added. Conformity assures that definitions and uses are consistent throughout the warehouse.

Conformed dimensions assure that users will be able to drill across data sources and arrive at conclusions that were not possible without a data warehouse. This is simply assuring that customer id 1234 for John L Smith in one fact table is the same customer id 1234/John L Smith in another fact table. This example certainly seems obvious. However, determining conformed dimensions is not always easy.

Two departments may view the information very differently. This is often the case between finance and sales or marketing. Finance typically views things as they are tracked in the accounting system. While marketing or sales may view them entirely different. For example, when tracking sales by location hierarchy, finance may have a hierarchy that is based on legal entities. However, marketing may not make any distinction between these legal entities and may be more concerned with brand or geography. These cases may present challenges to modelers and to arriving at conformed dimensions.

Conformed facts ensure that terminology for measures in one fact table mean the same in another. Revenue always means revenue, right? Well, again finance's definition of revenue may be different than marketing's. Once again, we have to ensure that our terminology is consistent. Doing so, will allow us to derive meaningful results across fact tables.

The concepts of conformed dimensions and facts are the cornerstone concepts to dimensional methodology. Without them, the data warehouse becomes a database of siloed data marts.

I extend the concept of conformity to include data naming standards. Consistent naming schemes make the database understandable and usable . So perhaps, we should call this one conformity and consiststency of a data warehouse.

February 11, 2009

A year and a half ago, I posted ITtoolbox about the basics of data warehouse. That post got a few interesting comments even one as recent as last week. I thought I would re-post it here then further explore the topic in subsequent posts. ...

A friend of mine recently asked me to help her develop some ideas for
explaining data warehousing to a group of business users. It seems
there was some confusion in her company regarding why their so-called
data warehouse was really just a data dumping ground and not a true
best practices data warehouse. The list below details what I think
separates a data warehouse from an archive database, data mart, ODS,
etc.

The data warehouse data model should be designed for the following:

1) Data is comprehensive - Data is captured and consolidated from multiple systems.

2)
Data is conformed - This is the famous line "single version of the
truth". Data elements are conformed so that the definitions of
"customer" or "revenue" mean the same thing no matter which system it
originated. Tables are conformed when they can be queried across
dimensions and facts without changing the meaning of the results. This
is what is needed to truly integrate data in the warehouse.

3)
Data is granular - Ideally, we capture and store data at its lowest
level of granularity. You can always aggregate up, but you can't drill
down if the data isn't stored that way.

4) Data is historical -
The data warehouse is able to present a view of the business at a
particular point in time and track Key Performance Indicators (KPI's)
over time.

5) Data is shared - A data warehouse that cannot be queried or otherwise accessed has little value.

December 10, 2008

I saw this note that CA is going to integrate the Exeros Data Profiling solution with ERwin Data Modeler. This is an interesting idea. The ETL vendors have been doing this over the past few years, but it makes sense that a data modeler have access to this capability as well. I find it interesting that the press release calls out MDM in particular. MDM projects are deeply dependent on quality data. This responsibility usually falls on the integration specialist. However, if the modeler can better understand the nuances of source data ahead of time, they can avoid a lot of headaches during implementation.

I have always done profiling as part of my modeling activities. However, I have seen a number of others who didn't.