Implementing Many-to-many Relationships in Data Warehousing

This article will discuss how to make many-to-many relationships in data warehousing easily queried by novice SQL users using point-and-click query tools.

This is a big problem with Oracle Discoverer-like tools where the metadata layer is basically a set of pre-joined tables from which the user simply clicks on columns and hits the run button. You can create custom complex queries that they can run, but then every query is custom, which defeats the purpose of the tool in the first place.

The design goal is to create a structure that is simple for the end user and which normally translates to something as flat as possible. This article will go through the different methods of implementing many-to-many relationships, and look at their effect on query complexity, especially for someone who use a tool that hides the SQL.

The typical data warehouse data model is a fact table surrounded by many dimension tables.

In the world of data warehousing, many-to-many relationships are sometimes unavoidable, and we all know that you do not want to normalize your fact tables. One example would something like sports interests or job categories that you’re qualified to work for. The model for that in a normalized OLTP environment would be something like:

The data warehouse world does not function with normalized fact tables since you need to start doing lots of select count(distinct) resulting in facts columns not being additive. I won’t get into that discussion here, but search for articles or literature by Ralph Kimball, the guru of data warehouse design.

Fortunately, there are a few ways of implementing many-to-many relationships for data warehousing. Typically, most DBAs pick either the Boolean Column, the Multiple Column, or the Bridge Table method.

The boolean column method consists of creating a column for each possible value, like so:

Each boolean column can be a numeric (1 or 0), or character (Y or N).

The multiple column method consists of having columns for the number match the number of choices one can make. This has its limitations since it is tightly coupled to the application; but is easily transformed. The typical data model would look like:

In this case, you could only have five sports even though you could have several dozen choices.

The last method is the Bridge Table method, which you may find in data warehouse literature. It is basically a many-to-many relationship attached to a non-normalized fact table. It would typically look like:

Note that the weighing factor is there mainly to distribute the dollar amounts evenly across the categories, where the sum of the parts cannot be greater than the total — for example, if one user has interests in Football and Baseball and spends $10, and the other has interests in Football and Soccer and spends $5. A revenue by breakdown cannot be…

Football: $15
Baseball: $10
Soccer:$5
Total: $30

…when in reality I only made $15. The weighing factor here would be equal to the number of interests for that user (i.e. it would be 2 for each user in this instance).

The result would then be:
Football: $7.5
Baseball: $5
Soccer: $2.5
Total: $15

Back to the main topic. Now that we have our methods, let’s see how easily we can write queries against them. I will have 4 queries for each method:

Breakdown count by sport

an â€œorâ€ condition: count of basketball or football

an â€œandâ€ condition: count of basketball and football

complex boolean condition: count of (basketball and football) or (soccer and hockey)

The Multiple Column Method

Breakdown count by sport query:

select sport, sum(total) from (
select sports_1 sport, count(*) total from sports_dm s, fact f
where s.id= f.sports_dm_id group by sports_1
union all
select sports_2 sport, count(*) total from sports_dm s, fact f
where s.id= f.sports_dm_id group by sports_2
union all
select sports_3 sport, count(*) total from sports_dm s, fact f
where s.id= f.sports_dm_id group by sports_3
union all
select sports_4 sport, count(*) total from sports_dm s, fact f
where s.id= f.sports_dm_id group by sports_4
union all
select sports_5 sport, count(*) total from sports_dm s, fact f
where s.id= f.sports_dm_id group by sports_5
)
group by sport;

An â€œorâ€ condition: count of basketball or football:

select count(*) from sports_dm s, fact f where s.id= f.sports_dm_id
and ( sports_1 in ('football','basketball')
or sports_2 in ('football','basketball')
or sports_3 in ('football','basketball')
or sports_4 in ('football','basketball')
or sports_5 in ('football','basketball')
) ;

An â€œandâ€ condition: count of basketball or football:

select count(*) from fact f
where sports_dm_id in ( select id from sports_dm
where sports_1 in ('football')
or sports_2 in ('football')
or sports_3 in ('football')
or sports_4 in ('football')
or sports_5 in ('football')
intersect
select id from sports_dm
where sports_1 in ('basketball')
or sports_2 in ('basketball')
or sports_3 in ('basketball')
or sports_4 in ('basketball')
or sports_5 in ('basketball')
) ;

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f
where sports_dm_id in ( (select id from sports_dm
where sports_1 in ('football')
or sports_2 in ('football')
or sports_3 in ('football')
or sports_4 in ('football')
or sports_5 in ('football')
intersect
select id from sports_dm
where sports_1 in ('basketball')
or sports_2 in ('basketball')
or sports_3 in ('basketball')
or sports_4 in ('basketball')
or sports_5 in ('basketball')
)
union
(select id from sports_dm
where sports_1 in ('soccer')
or sports_2 in ('soccer')
or sports_3 in ('soccer')
or sports_4 in ('soccer')
or sports_5 in ('soccer')
intersect
select id from sports_dm
where sports_1 in ('hockey')
or sports_2 in ('hockey')
or sports_3 in ('hockey')
or sports_4 in ('hockey')
or sports_5 in ('hockey')
)
) ;

Just imagine a few dimensions like that, and you would end up exceeding the SQL query text limit very quickly.

select count(*) from fact f, sports_bridge sb
where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
where sbd.sports_id = s.id
and s.description in ('basketball','football')
group by sbd.sports_id having count(*) >1);

This works, but you could also have an intersect in the subquery.

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f, sports_bridge sb
where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
where sbd.sports_id = s.id
and s.description in ('basketball','football')
group by sbd.sports_id having count(*) >1
union
select sbd.sports_id from sports_bridge_details sbd, sports_ref s
where sbd.sports_id = s.id
and s.description in ('soccer','hockey')
group by sbd.sports_id having count(*) >1);

Again, intersects can be substituted for group by having count(*) >1.

The following table summarizes query complexity vs. method for ease of query writing in an ad hoc query tool such as Oracle Discoverer.

Boolean method

Multiple column method

Bridge table method

Breakdown count by sport query

Complex

Complex

Easy

An â€œorâ€ condition: count of basketball or football

Easy

Complex

Easy

An â€œandâ€ condition: count of basketball or football

Easy

Complex

Medium

Complex boolean condition: count of (basketball and football) or (soccer and hockey)

Easy

Complex

Complex

As you can see, the Boolean Method is far superior in every category except the breakdown, and the Bridge is a close second. You have to use the Bridge method if the number of potential values in the dimension exceeds 100 columns or so. You can do it, but it looks ugly. The best of both worlds would be to merge the Boolean and Bridge column methods. Its fairly straight–forward: add the boolean columns to the bridge table as follows:

bridge-4.png is used twice, in the Multiple Column Method and in the Bridge Table method. If the methods really make the model look the same, maybe you can comment on it. If not, then replacing one or the other picture would be good.

Nothing against the author or the Python Group as this is a great article. But, anyone worth their salt when it comes to databases should see this as another example of how the dimensional model fails. Trust me, using the dimensional model is doing a disservice to yourself and the company you work for. Even Kimball cannot prove this model has any correctness to it. Where as, the relational model as 2500+ years of math behind it. Donâ€™t even get me started on trying to get the correct temporal relations in there. In short: donâ€™t be fooled by the marketing!!

good article, but i have a question. in the the Bridge Table method graphic, the weighting factor is on the bridge table. in the kimball dwh books, the weighting factor is on what you refer to here as sports_bridge_details. can you clarify this discrepancy? why do you place the weighting factor on the group entity? thanks

@Ryan: please don’t take this ill-given advice as a counter-example for Kimball and dimensional modeling. Seeing this article made me my poor BI heart bleed to say the least….

The thing to remember when handling relationships (including N:N) in a dimensional modeling exercise is to express them as fact tables. In this specific case a fact-less fact table. Of-course it is often needed to create analytical or aggregate views (facts) on top of that to make reporting easier. That is especially the case if you are going to let the users report themselves on data structures like these. Your solution is a nightmare in that respect. You will get Cartesian products simply based on the fact that you slammed a N:N relationship into a dimensional model. In a self-service reporting setting this is deadly. A users drops in another column from an extra table and the number of returned rows changes instantly.

In a star the fact is a bridge. It attempts to ‘bridge’ the many-many relationship amongst the dimensional context. In other words without any measure [ factless facts ] it is in BCNF; whereupon the introduction of a metric, the fact is in 3NF.

Hello everyone,
The article was really informative and it gave me a good idea on how to resolve N:N relationships.
I have to model a star of Bugs Tracking. One of dimensions is named BUGS. We have differents status for one bug as accepted , new , closed etc .
One status relate to several bugs and one bug relate to several status.
You write :
“The best of both worlds would be to merge the Boolean and Bridge column methods”
Could you please me help to design this merge model (Boolean and Bridge column ) for my star ?
Must I include into the BUGS dimension boolean fields as ? :
accepted , new , closed (9 status fields )
I do not understand the logic of merge Boolean and Bridge column in this case.
Is it for reduce the number of rows if the BUGS dimension?
Thanks in advance.

PYTHIAN®, LOVE YOUR DATA®, and ADMINISCOPE® are trademarks and registered trademarks owned by Pythian in North America and certain other countries, and are valuable assets of our company. Other brands, product and company names on this website may be trademarks or registered trademarks of Pythian or of third parties. Use of trademarks without permission is strictly prohibited.