Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

As of September 1, 2013 we decided to remove SQLCAT.COM site and use MSDN as the primary vehicle to post new SQL Server content. This was done to minimize reader confusion and to streamline content publication. MSDN SQLCAT blogsalready includes most SQLCAT.COM Content and will continue to be updated with more SQLCAT learnings. You can also find a collection of our work in SQLCAT Guidance eBooks.

To make the transition to MSDN smoother, we are in the process of reposting a few of SQLCAT.COM’s blogs that are still being very actively viewed. Following is reposting of one of the SQLCAT.Com blogs that still draws huge interest. You can find this, and other guidance in the SQLCAT's Guide to Relational Engine eBook.

Also, Follow us on Twitter as we normally use our Twitter handles @MSSQLCATand @MSAzureCATto announce news and new content.

Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

Building a large scale relational data warehouse is a complex task. This article describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. Most of these tips are based on experiences building large data warehouses on SQL Server 2005.

1 - Consider partitioning large fact tables

Consider partitioning fact tables that are 50 to 100GB or larger.

Partitioning can provide manageability and often performance benefits.

Faster, more granular index maintenance.

More flexible backup / restore options.

Faster data loading and deleting

Faster queries when restricted to a single partition..

Typically partition the fact table on the date key.

Enables sliding window.

Enables partition elimination.

2- Build clustered index on the date key of the fact table

This supports efficient queries to populate cubes or retrieve a historical data slice.

If you load data in a batch window then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF for the clustered index on the fact table. This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates.

Build nonclustered indexes for each foreign key. This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate.Use filegroups for administration requirements such as backup / restore, partial database availability, etc.

3 - Choose partition grain carefully

Most customers use month, quarter, or year.

For efficient deletes, you must delete one full partition at a time.

It is faster to load a complete partition at a time.

Daily partitions for daily loads may be an attractive option.

However, keep in mind that a table can have a maximum of 1000 partitions.

Partition grain affects query parallelism.

For SQL Server 2005:

Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism).

Queries touching multiple partitions use one thread per partition up to MAXDOP.

For SQL Server 2008:

Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned.

Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger).

4 - Design dimension tables appropriately

Use integer surrogate keys for all dimensions, other than the Date dimension. Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow.

Easy to write queries that put a WHERE clause on this column, which will allow partition elimination of the fact table.

Build a clustered index on the surrogate key for each dimension table, and build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads.

Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data.

5 - Write effective queries for partition elimination

Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table.

6 - Use Sliding Window technique to maintain data

Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement.

Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking.

Create the load staging table in the same filegroup as the partition you are loading.

Create the unload staging table in the same filegroup as the partition you are deleteing.

It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day).

How does this change for 2012 / 2014? I notice it's a little out of date (like only supporting 1000 partitions)

Ian Bennett

3 Jul 2014 5:40 PM

Star join optimisation works fine with date datatype key.

Kev Ross

28 Aug 2014 3:33 AM

@Martin Smith, the date dimension table usually contains a number of extra "derived" fields that enable efficient grouping and aggregating. For example, each date in the date dim will have quarter, week number, month etc that can be used directly without the need to use functions (which slow the queries down).

ROY SINGER

2 Oct 2014 3:13 AM

Thanks for this collection of points which I appreciate, however I beg to differ on Build clustered index on the date key of the fact table since this is not a unique key on a Fact Table for instance you could have many records having the same DateKey? Moreover in most of the datamarts I have used or created, the datekeys are foreignKeys to the Date or time dimension. I would much appreciate if you could please clarify. Thanks my email is roy.singer4@gmail.com