Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

The dominant topologies of Data Warehouse modelling (Star, Snowflake) are designed with one-to-many relationships in mind. Query readability, performance, and structure degrades severely when faced with a many-to-many relationship in these modelling schemes.

What are some ways to implement a many-to-many relationship between dimensions or between the fact table and a dimension in a data warehouse and what compromises do they inflict with regards to necessary granularity and query performance?

You need to state your question more clearly. This is possibly why no one has answered it since the 4th. What you state in response to my answer is not the same as your original question.
–
IanCJan 7 '11 at 7:00

5 Answers
5

In my experience, a recursive hierarchy is the most practical way of tackling this. It offers the following advantages:

Unlimited depth.

Compactness.

Flexibility.

Speed.

By contrast, it takes an extra table for each level of "-to-many" joins. This is hard coded and difficult to maintain against schema updates.

By using filtered indexes, a large table of hierarchical joins can perform at superior speed to dedicated tables. The reason is each join is only "parent-child" compared to "to join table to data table". The latter has more indexes to process and store.

I've been trying to solve this problem for many years. Recently, this is what I came up with.

You asked "What are some ways of modelling these many-to-many and what are their performance and granularity implications?". I answered on modelling. No need to down-vote.
–
IanCJan 7 '11 at 6:58

4

You need to provide more data on what you need. I overcame the exact problem you've stated via a recursive hierarchy. But, without knowing something about your data and its connections, it's very difficult to answer.
–
IanCJan 7 '11 at 7:01

2

Yes, they don't natively model this. What would be wrong with adding one more table & join, thus achieving the many-to-many? In an RDBMS, no matter how you structure your tables, you're going to have 2 joins for a many-to-many. There's no shortcut. The only possible exception is arrays in PostgreSQL or Caché/M.
–
IanCJan 7 '11 at 7:06

1

(A recursive hierarchy is a good idea, actually.) One way I solved the problem was precomputing the list of possible many-to-many relationships within a dimension, referencing that to a normal dimension table, and then joining the fact table to that summarized dimension table. Your answer of "recursive hierarchy" is another useful design answer. I'm wondering if there has been any research on the performance implications of these various hacks?
–
Brian Ballsun-StantonJan 7 '11 at 7:10

3

@Brian don't forget the votes for useful answers. It helps create community. To answer your question, I haven't come across any research on these hacks, except "what is faster: a recursive CTE or a manual tree build?". You prior stated solution makes good sense. I would combine it with an indexed view, which of course assure you always have the correct pre-populated relationship map.
–
IanCJan 7 '11 at 7:20

Most OLAP servers and ROLAP systems have a means to deal with M:M data structures now, but there are some caveats about this that you will need to pay attention to. If you do implement M:M relationships you will need to keep an eye on your reporting layer and what tools you want to support.

Scenario 1: M:M dimension onto a fact table

An example of this might be multiple drivers on a motor policy. If you add or remove a driver, the policy adjustment transaction may have a relationship to a list of drivers that changes with the adjustment.

Option 1 - M:M driver-fact bridge table
This will have quite a large volume of data, as it has drivers x transactions rows for a given policy. SSAS can consume this data structure directly, but it is slower to query through a ROLAP tool.

If your M:M relationship is based on entities that are specific to the fact row (e.g. drivers on a car) this might be appropriate for a ROLAP tool as well, providing your ROLAP tool supports M:M relationships (e.g. using contexts in Business Objects).

Option 2 - Dummy 'combinations' dimension table
If you are mapping a list of common codes to a fact table (i.e. the linked entities are not peculiar to the fact row) then you can take another approach that will reduce the data volumes. An example of this type of scenario is ICD codes on an inpatient visit. Each inpatient visit will have one or more ICD diagnoses and/or procedures listed against it. The ICD codes are global.

In this case, you can make up a distinct list of the combinations of codes on each case. Make a dimension table with one row for each distinct combination, and have a link table between the combinations and the reference tables for the ICD codes themselves.

The fact table can have a dimension key to the 'combinations' dimension, and the dimension row has a list of references to actual ICD codes. Most ROLAP tools can consume this data structure. If your tool will only work with an actual M:M relationship then you can create a view that emulates the M:M relationship between the fact and the coding reference table. This would be the preferred approach with SSAS.

Advantages of option 1:
- Appropriately indexed, queries based on selecting fact table rows with a certain relationship through the M:M table can be reasonably efficient.

Slightly simpler conceptual model

Advantages of option 2:
- Data storage is more compact

You can emulate a straight 1:M relationship by presenting the combinations in a human-readable format as a code on the 'combinations' dimension. This might be more useful on dumber reporting tools that lack support for M:M relationships.

Scenario 2: M:M relationship between dimensions:

Harder to think of a use case, but one could envisage something out of healthcare with ICD codes again. On a cost analysis system, the inpatient visit may become a dimension, and will have M:M relationships between the visit (or consultant-episode in NHS-speak) and the codings.

In this case you can set up the M:M relationships, and possibly codify a human-readable rendering of them on the base dimension. The relationships can be done through straight M:M link tables or through a bridging 'combinations' table as before. This data structure can be queried correctly through Business Objects or better quality ROLAP tools.

Off the top of my head, I can't see SSAS being able to consume this without taking the relationship right down to the fact table, so you would need to present a view of the M:M relationship between the coding and the fact table rows to use SSAS with this data.

I would like to know exactly what kind of many-to-many relationship you have in mind in your model, either as it is in the transactional system or whatever data model it currently is in.

Typically, many-to-many relationships between dimensions are facts about dimensions. The fact that a customer orders from a several branch offices who service many customers, or something like that. Each of those is a fact. It would have an effective date or something like that, but the relationship could be "fact-less". The relationship itself may have other dimensions besides customer and branch office. So it's a typical star schema with a (potentially fact-less) fact table at the center. How this star can relate to other dimensional stars in the warehouse will obviously depend. Any time you combine different stars, you would do so on the business keys and have to ensure you aren't performing inadvertent cross-joins.

Typically one does not report on such dimension relationship tables to the same degree as larger fact tables and when they do, it's not always as much data, so it doesn't tend to affect performance. In the above case, you might look at customer/branch utilization over time, but better data about actual order quantities would be available in your order fact table, which would presumably also have dimensions for the customer, branch, etc. These are not what most people would consider many-to-many (although an order could be considered to define a many-to-many relationship from customer to branch), so are more typical in data warehouse environments. You would only be doing numerical aggregates on many-to-many models if you had rolled up summary information to that relationship level - i.e. customer, branch, month, total sales - a summary fact table, which looks more like a one fact-many dimension model now that it has numerical data.

Good answer. There are two cases that I'm exploring here. An N:M between fact and dimension, and an 1:N:M between fact, dimension, and dimension.
–
Brian Ballsun-StantonJan 12 '11 at 0:43

2

@Brian Ballsun-Stanton When you say N:M between fact and dimension, you mean that a given fact has several undistinguished and varying cardinality sibling dimensions which all apply, like tags on questions? So one question (fact) is tagged sql-server,data-warehouse and another is tagged data-warehouse, sql-server, business-intelligence. I would still pull that into a separate star for the tag assignment fact (which has a little different grain than the question fact). It's going to have great indexing possibilities and you'll be able to capture the dimensional change more obviously.
–
Cade RouxJan 12 '11 at 3:01

@Brian Ballsun-Stanton As for 1:N:M, that's a snowflake, I guess, and I tend to avoid that. If you want to define other stars (or bridges) for relationships between dimensions that's fine. Remember that a dimensional data warehouse is not normalized and above all it's a practical construct designed to support specific types of operations, not to specifically represent the real-world entity relationship or eliminate redundancy.
–
Cade RouxJan 12 '11 at 3:03

1

@Brian Ballsun-Stanton Have a look at the Kimball Forum and what he calls bridges and outriggers in his toolkit books: forum.kimballgroup.com/…
–
Cade RouxJan 12 '11 at 3:24

Here are some relevant articles from Kimball and others which may apply to modeling a given proposed many-to-many relationship. Note that a many-to-many relationship is a concept in the problem domain/logical model only. In a normalized OLTP model it would still be handled with a link table which is, of course, one to many each way. In a non-normalized Kimball data warehouse model there are a number of ways to do this, one of which basically treats that link table as the fact at the center of a star. Another is as an array of flag columns.

Ultimately, the choice is going to depend on what exactly you are modeling, how it's changing and how you want to report on it. This is where the dimensional modeling and data warehousing in general diverges sharply from the normalized model. The normalized model concentrates on the logical and theoretical relationships in the data, which data warehousing always keeps an eye on the realistic use cases and denormalizes to make those perform.