zondag 4 maart 2012

Four different datamodeling methods

Introduction
Most of the time, 3NF are datamodels that are used in transactional systems like an orderentry application. For datawarehousing we had until a couple of years ago the (historized) 3NF and the starschema. Since a couple of years there are some new kids in the datawarehouse town: Datavault modeling and Anchor modeling. These datamodels try to deal with the disadvantages of 3NF and starschemas.

We are starting to learn that the starschema is more an end user datamodel that is easily understood by end users. It's a high performance model. Datavault modeling and Anchor modeling are more focussed on storing the right information in such a way that you achieve agility, speed of development and auditability. There are more reasons but i don't want to elaborate too much.

At first i wanted to write down a successor of my first blogpost about Anchormodeling but soon the blogpost became more and more an overview of the four types of data modeling techniques. I hope to write more about anchormodeling in next blogposts.

In this exercise i'll use the 3NF datamodel of the AdventureWorks2008LT Database . This is a simplified database of the large SQL Server AdventureWorks 2008 database. You can find this LT database on the site Microsoft codeplex. There isn't a SQL Server 2012 version available yet, as far as i know. I'll transform this 3NF datamodel into a Datavault model (As i've blogged earlier), a anchor model and a starschema model.

Adventureworks2008LT (3NF)
Below a representation of the 3NF datamodel as i've analyzed from the Adventureworks2008LT database. There are a couple of transaction tables (SalesOrderHeader and SalesOrderdetail) present and some master data tables (Customer, Productdescription, productcategory, etc). At first sight there seems no reference data in the diagram.

Adventureworks2008LT (3NF) in more detail
Below the database in more detail. I've decided to leave the errorlog and buildversion out of the models.

The Datavault model
As mentioned in my earlier blogpost i've used Quipu as a tool to generate the datavault model from the Adventureworks2008LT database. This tool can be downloaded, for free. The datavault model that is generated by Quipu is a typically source datavault model. You can find more information about this discussion at the blog of Ronald Damhof.

Now, the model that is generated is shown in the diagram below. The blue tables are the links, the orange tables are hubs and the yellow ones are the satellites. I've organized the diagram in a way that it looks like as much as possible as the Adventureworks2008LT 3NF model.

At first sight, you'll see that the number of tables are more than the 3NF model. That is because the relations have become a table (link), the busineskeys have become a seperate table (hub) and the descriptive fields are stored in another table (sat). In my opinion the main advantage of this model is that you can increase agility and speed of development by separating the relations, business keys and the descriptive fields. Auditability seems to me a derived advantage of the agility argument: a historized 3NF datawarehouse datamodel is also auditable. But because the datavault model is more flexible that the 3NF model you can achieve the auditabilty much easier and faster than the historized 3NF datamodel.

The Anchor model
Below you can see a first draft of the Anchor datamodel. In my former post i've briefly explained a piece of the anchor modeling technique. I'm currently investigating this modeling technique for usage in my projects. It seems an interesting technique. The agility is pushed to the extreme because every field in a source table is a table in the anchor model. You can imagine that this enhances maximum flexibility.

The Adventureworks2008LT 3NF model is exploded into a huge number of tables in the anchor model. A good naming convention is very very appropriate. Even better, a tool should be available to manage the tables. In a future post i'll show the tables that are created and i can tell you now that the 9 tables in the 3NF model are about 100 tables in the anchor model (depending of the number of attributes in a table in the source model).

The diagram is created on the site http://www.anchormodeling.com/ and although it's a great solution and it looks great there are some disadvantages. It took me quite some time to create the datamodel.

The starschema
Below an example is shown of a starschema based on the Adventureworks2008LT 3NF datamodel. There are a couple of directions possible depending on the business process you're supporting. I've chosen to model it like this but you could also combine the SalesOrderHeader and the SalesOrderDetail in one fact. Or you could remove the snowflake table Category and move it to the fact. So depending on your (future?) business question you determine the starschema model.

In this example i've decided to split the SalesOrderHeader and SalesOrderDetail table in two facts. The SalesOrderHeaderDD is the link between the two facts. There are a couple of difficult relations in the AdventureWorks2008LT 3NF model:

The relation product-Category.

The n:m relation between the productmodel and the Productdescription entity.

The relation between customer and address (CustomerAddress).

I've decided not to model the Customeraddress in the model. I've also decided to model the other difficult relations in the same way as the 3NF datamodel, resulting in a snowflake relation and a bridge table solution. Off course there are other decisions possible.

Conclusion
First, this blogpost started in my mind with a successor of the anchormodeling blogpost but later on i've decided to rewrite the blogpost to a overview blogpost of the four datamodels i'm interested in. I hope it gives an overview of the models.

So in my opinion we have now the following options to model a datawarehouse:

Historized 3NF model.

Datavault model.

Anchormodel.

Starschema.

From these options it seems that for a data storage foundation the best solution is datavault model or the anchor model. For enduser reporting and cube design the starmodel seems the most optimal solution. The main problem of the historized 3NF model is that the cascading changes that can happen and therefore this model is not the best option (in my opinion).

8 opmerkingen:

Hi Hennie,Nice overview. I was wondering: Why don't you mention the historical staging area? Perhaps you don't look at it as a data modeling method? It's of course not much more than a historized copy of your datasources. Anyhow, in some situations this 'modeling method' is applied as well.

Hi Richard,Thanks for your comments. In my opinion, the historical staging area is an architecture component in your datawarehouse architecture. You can use the 3NF, Datavault and Anchor modeling as a source history tracker (in the historical staging area). A star schema is less suitable. Greetz,Hennie

Goods book are: Datawarehouse toolkit of Kimball: concepts of dimensional modeling are also feasable in Datavault. Supercharge your datawarehouse of Dan Linstedt, Modeling the agile datawarehouse with datavault. Take a look at temporal datamodeling of snowgrass for your bitemporal skills.. Currently reading the integrated data hub of Dario manggano but no opinion yet about this one. Good luck....

@Frank: In my opinion there are only reasonable two modeling methods : Starmodeling and Anchor like modeling (Anchor modeling and Datavault) for your datawarehouse. 3NF has been proven difficult we have learned in the past.. I know there are more methods but these do not have a common firm ground in the dwh scene.

Richard has mentioned HSA as a alternative and I do think that this is a alternative for a Source Vault approach ( As in Source Vault / Business vault).

I’m delighted to give you some pointers – but please do bear in mind that any consultant worth his salt would look at the specific issues in a given company before suggesting changes to its data warehouse systems, and I know nothing about your company.