The Set Theory of Data Warehouse Design

I love volleyball as much as I love data. However, as this is a blog about data, I’ll leave the theories I have related to setting the ball at the door and move forward on the topic of data. Believe it or not, you learned the basics of how a data warehouse works way back in Algebra I (probably around the same time you were also learning volleyball in gym). Perhaps you recall being tortured by an SAT or exam question that looked like the following:

A = {1, 4, 5, 7, 9}

B = {1, 6, 7, 8}

Solve for the Union of sets A and B. (Answer at the end of the article!)

The basic premise of a star schema states that wide, detail oriented dimension tables will have a series of 1 to many join to a long narrow fact table. These tables should be set as inner joins, meaning that only the results which will return on a query are those which exist on each and every table used in the query. This is essentially the intersection of multiple sets. For example, when I buy a gallon of milk at the grocery store, this is an intersection of the Customer dimension (me), the Store dimension (Store number, address, etc), the Product dimension (Milk, gallon size, brand name), the Time dimension (Date, Time of purchase) – all of which are joined via the fact record, which records the keys to join to each of these dimensions and records other numeric attributes relevant at the moment of the transaction, such as price, transaction ID, and tax paid.

To take it back to Algebra I, the fact transaction is the Intersection of the dimensional set attributes. The fact holds the transaction, or transactions, where all of the dimensional data is related. To relate the example above into set math:

Fact = Customer ∩ Store ∩ Product ∩ Time

The reason for this being an intersection rather than another type of set operation is because a dimensional model is designed with a series of inner joins for performance. As a result, you will only see fact records associated with an equivalent dimension record you have chosen to display and/or filter on.

Now, let’s say you wish to do some additional analysis on my purchases over time. When you look at the Store dimension, you may see attributes that allow you to build out a hierarchy via a series of one or more group by statements. For example, you may wish to insert Chain, State, and Store Number. This will build a natural hierarchy where each value represents a 1 to 1 or 1 to many relationship. A nationwide chain has many states, and each state can have many stores. Inversely, a mom and pop may only have one store, in which case it only has a single state and chain name.

Going back to your Algebra I textbook, this is represented as as series of Subsets:

Store Dimension Hierarchy = Store Number ⊆ State ⊆ Chain

Note that in the above, the subset notation has a half equal sign beneath it. To refresh your memory, this denotes that it is possible for the subset to exactly equal its parent, similar to a less than or equal to symbol. The mom and pop example above would be the use case for when this equivalency occurs.

As discussed previously on this blog, you will join to the fact at the lowest level of this hierarchy. What you may not have realized previously is that this join is just a continuation of the subset listed above, as each fact record is a subset of the corresponding dimensional records. This is why you will see joins in a star schema represented with a 1 to N join from the fact to the dimension table.

If you are using a BI tool to model your query, such as Cognos Framework Manager, then the query optimizer will maximize your query efficiency – even in a snowflake design – as long as you continue to model the snowflake dimensions in a 1 to N hierarchy design. Knowing that the records from dimension A are a subset of dimension B allows the tool to understand and build the most efficient SQL it can, as it recognizes that from a data structure standpoint that the table relationship in the snowflake is no different than executing a series of group by statements on a single dimension table. To be clear, in my experience, you are usually better off incorporating both tables in this case into a single dimension, but the difference can sometimes be minor enough to allow for the snowflake due to other reasons, such as ETL concerns or security.

Remember our original question?

A = {1, 4, 5, 7, 9}

B = {1, 6, 7, 8}

Solve for the Union of sets A and B.

Unionsrepresent the combination of each unique result within multiple sets. Going back to Algebra I, the union of the above sets would be represented as follows:

A ∪ B

Per our aforementioned definition of a union, the resulting set would be:

{1, 4, 5, 6, 7, 8, 9}

In a warehouse, this can be represented with distinct result sets unioned together in a reporting tool, but most often, it’s represented with an “IN” filter. The “IN” filter is essentially building a set of values to subsequently filter your query on. For example, if set A were a series of Store IDs, the results of a query run against this model would show the same result as the following :

Here’s where things get interesting. If you’ve ever worked with SQL, you have likely encountered the evil monster called a cartesian product. As a refresher in Algebra I terms, cartesian products are the product of two potentially unrelated sets of data represented by a series or ordered pairs. If we take the same two data sets in the Union example and determine the result of the cartesian product of these sets, you will get the following:

Note that this is represented by a product symbol notation because the results will be an equivalent number of ordered pairs as if you were to simply multiple the counts from each set (in this case, 5 values in set A multiplied times 4 values from set B equals 20 unique pairs).

When querying multiple dimensions in a star schema, there could potentially be one or more records on the fact table for each ordered pair within the cartesian product set. However, by joining to the fact table, you ensure that potential dimensional cartesians without an associated fact record will be eliminated. This is due to our first example, the intersection, placing a threshhold filter on the ordered pairs from the dimensions for those dimensional records where an associated fact record does not exist for both values in the ordered pair. The key to this performing well is the fact table having the associated join keys for the desired grain of the dimension table. In other words, the query optimizer can look up the key values from set A and the key values from set B, and build the result set by using those keys as filters on the fact table. It can then use that result set to match back to the dimensions and populate the large text fields desired in the query results.

Hi James,
Why should we set the dimensional tables as inner joins. What if I want records from one dimensional table regardless of if it is in another dimension. E.g. I want to see all the stores in the store dimension regardless of if I made a purchase at that store or not?

Great question! Each dimension can be viewed independently of the other tables. For example, if you wish simply list your stores, you can just query the store dimension. When you use an outer join on one of these dimension tables, you will create a cross-product query. This in turn will cause the automatic rollup of your fact records to erroneously aggregate once per every outer joined dimension record. Remember, your fact table is built to analyze a specific type of transaction. If the data doesn’t include that transaction type, then there is no way to conclusively correlate the dimensions. Well, there’s one way I can think of, but that’ll be in a future article.