Trick Question -- Part Quattro

Spoiler: TPH is an evil trap

In the last installment I pleaded for the equal treatment of the database schema and object model when implementing an application with a database, on the grounds that the failure of either means the failure of the whole system. I think it follows that a decent application ought to have both a good quality database schema and a fabulous code base. If the database is an embarrassment, on relational grounds, then it compromises the whole, just as if the app code is an embarrassment on OO grounds.

(I am working from the assumption that we do try to create good quality software; there's the whole "sales driven design" notion that says we can't ever create anything of value because it's too costly; I reject that notion as ... well, boring. There would not be much to write here if we simply give up trying to make anything good.)

So, for the sake of argument let's assume we've picked an O/R Mapper to help create the app by automating much of the plumbing of repetitive SQL queries, and moving data to and fro. I am going to invoke NHibernate, because it seems decent, and mature, and so on. I think Entity Framework has the potential to turn into something great, if built out correctly in upcoming versions, yet it seems so far to be over-inflated with a lot of marketing noise, while it catches up to other solutions. Then again, I am not really an expert with ORM, and others could to a better job of comparing. Assume an appropriate level of abstraction is maintained between the database and the application by some implementation of either views, functions and stored procs, or by a nice, clean data access layer, which you can do with a mapper if you implement it correctly.

The most trivial bits of O/R mapping aren't that interesting - a simple atomic object corresponds to a simple row in a simple table using a surrogate key. A collection like List<T> in the app is a collection of such objects representing a subset of the rows from a table. "Product Object" implies "select ProductID, Name, Price from Products where ProductID = n." That's all quite basic, and something one would probably create manually using traditional coding techniques. The place where things get interesting and hairy is in the more complicated cases where the object model and the relational model don't match up so neatly. I'll cut straight to the chase: polymorphism is where things get ugly, and that's where we should start to be careful. And if you look up polymorphism related to popular ORM's, an issue rears its ugly head immediately: TPH or Table Per Heirarchy persistence model for polymorphism. Consider these descriptions:

I don't have anything against the smart, hard-working folks that put this stuff together, nor am I anti-ORM in any way. Still, amazingly, all these documents and blogs about both NHibernate and Entity Framework generally give about equal billing to three methods for representing polymorphic objects in a relational schema -- even though one, TPH (which, horribly, seems always to come first in every such list) violates the most basic best practice for relational database design. It's not normalized even in a trivial sense, it makes it impossible to enforce integrity on the data, and what's most "awesome:" it is virtually guaranteed to perform badly at a large scale for any non-trivial set of data.

Luckily, there is a workable implementation in most of these tools, Table per Type. What's missing from the docs and recommendations around TPH is some big, bold print that says,

"Never do TPH, because it's a horrible nightmare in the relational world, a violation of 30 years of best practice in database design, and probably makes a DBA somewhere cry inside. Always do TPT. TPH is presented here only as a twisted curiosity, and is to be avoided."

Lack of that warning about TPH means people might be tempted to actually use it, which is unfortunate.

I imagine all DBA's will hate TPH. Some Devs I am certain will complain that the DBA's are being obstinate or unreasonable, but here's the thing: TPH is a bad idea, and the DBA's know that. If the shoes and feet were reversed, so to speak, where the DBA's were demanding that Developers violate some fundamental programming principles in the design of their code -- without even really understanding it -- they might be a bit miffed as well, and would have valid reasons. The sad thing is that blunders like TPH can sour DBAs on the entire notion of ORM, based simply on the notion that ORM tools can seem to encourage bad database design. It really ought to be the other way around. But let's set the cultural argument aside and look at technical reasons why this method is to be avoided.

I worked with an ISV-provided piece of software for years that, by coincidence, had important, busy tables designed on the TPH principle. Ironically it didn't use anything as elaborate as an O/R mapper; the developers had just adopted the same notion for polymoprhism. In this case the application has an accounting function, among other things. Each financial transaction in the accounting part of the system, of every type (charges, payments, checks, invoices, etc.) is stored in one massive Transactions table. The table is 120 columns wide and practically every column is nullable, and might or might not have a value depending on what type of transaction was indicated - you guessed it - in the "discriminator" column. There are 16 types of transactions, if I recall. This massive table also stored both posted and un-posted transactions (hey, they're all "transactions," right?). It also contained the header records for batches of transactions that were entered at the same time. In order to "post" un-posted transactions, an elaborate process would change the discriminator/record type and set various dates and flags in the table, through a ... erm, "complex" ... procedure. A massive amount of this huge, largely empty table (nulls take space, even without values) had to be kept in cache on the SQL Server all the time. It was basically filling the RAM on the server with blank space, and the odd value here or there at random. Ironically, this design provided no advantage whatsoever. It's wasn't like there was a trade off of this advantage for that disadvantage. It was simply, purely horrible.

The application did not perform well, at all, at scale, and was just a constant source of expense and pain for us. Not because of some purist ideal of normalization, but because all the financial data was mixed into a great disorganized pile. The frustrating thing about it was that had it been based on a reasonable schema, there would have been no performance issue at all, and it would easily have worked for hundreds or a few thousands of end users on fairly modest hardware.

Of the docs and blog posts I have been able to find on this topic, most give only a cursory overview, along the lines of "you might want to do this, or that, it's not that important." Alex James does more in the way of explanation about how they work and why to pick one. Interestingly, there's a table on his blog page ranking them by different criteria, and the TPT, relational-friendly option he marks "winner" in every category except performance. I had the privilege of attending a session at the PASS Summit that he gave, which was excellent, and he is a very sharp guy with an understanding of the database side of this issue. Yet I do have to wonder about the "performance" argument, as it's not been my experience at all. Maybe I am jumping to conclusions, but I wonder if it's the old myth about "avoid joins to enhance performance" coming back around.

I suspect this: for a trivial case, where you have five or six object types that are quite similar, then stuffing them into one table might avoid some joins - as long as the cost of the interleaving of rows of different types, which reduces the efficacy of indexes and takes more memory and slows many types of queries down, doesn't cancel the intended optimization. However for real world cases with large sets of data, design-by-join-avoidance isn't really a good strategy. Better performance is gained, in most cases, by the partitioning of data into smaller, more highly sorted and organized sets. Yes, the server has to perform joins, but the cost of joins is much less than the cost of brute-force searching through a high volume of interleaved, mixed, sparse rows of different "types" in one massive table.

Perhaps a clearer analogy: suppose we had one uber-table, called "Object." Everything inherits from Object, so why not one massive, ultimate Table Per Heirarchy with Object as the root. We just put a discriminator that indicates what type of object each row represents, and a nullable column for every property of every object. Simple! The issue is that everything, in this case literally everything, is mixed together. Should perform swimmingly, because there are no joins!

But wait. There are other issues over and above joins. How does the server locate data in this table? How much cache would be required? What does an index look like? Clearly the server would perform better if the data were partitioned sensibly, into smaller groups, so that it could find related information. The answer, of course, is tables. Having multiple tables not just convenient, it's a performance optimization. In fact, there are many performance issues that can be solved by adding tables, not removing tables - most of the time if unrelated rows can be partitioned into separate sets (tables), it makes it easier and faster for the server to locate them.

So if performance is the one argument for TPH, the one that Alex says "trumps these other concerns," then I have to say I'm not quite buying it. In fact, I can't see why anyone would ever want to do TPH for anything other than a small toy of a system. In fact, can we turn that thing off in any of these O/R mappers? Where's the OFF button?

Back to the larger point, which is, "what are the major design issues for connecting apps to data?" I think the points I want to make are: a. the database schema still deserves careful attention and design, even in 2010, and b. the use of modern methodologies (agile, ORM, etc.) can be aligned with best practices but you have to be intentional about it. You have to be able to say things like, "TPH sounds neat-o, but it violates normalization fundamentals. What's up with that? That seems like a bad idea, even according to my olde-time gray-haired DBA over here. Why do that? What are my other options? Which ones don't make the database suck?" The end goal should be "and" - great database AND great application, no compromises.

Blog entries like this are what I’d like to see more of. Unfortunately, ORM seems to have become a religious war. In my opinion, it's in the SQL community's best interest to understand ORM to the point where they can work with and make suggestions to the developers. ORMs can be a win for both the application and the database because its whole purpose is to bridge the divide so that each tier can draw on its technical strengths. ORM does not mean that developers can go ahead and forget about good database design, or forget what an acceptable SQL query is.

TPH is just a cool three-letter-acronym for poor design. This approach to treating the DB like a big flat-file has been around for a VERY long time - especially due to developers who don't understand relational database strengths and benefits. In fact, it's one of the biggest issues I see when conducting my audits - so I called this out (in just a hand-ful of words) in my Top 10 Performance Secrets for Developers article for SQL Server Magazine:

What's great about this post is that Merrill is spot-on in calling out how dangerous it is for Devs to espouse this approach as a best-practice - especially in terms of performance. Because it's anything but a best practice. And it simply won't work as the size of the database grows.

And Merrill is also right that ORM is NOT the enemy here. In fact, it can be a great asset and ally not only to developers but to DBAs as well.

Consequently, to put this into perspective for developers, in C# ... this approach would be similar to creating:

a) a super-type called PolymorphicEntity

b) giving that object a Dictionary<string, string> as a Property called Properties List

c) storing 'polymorphic' attributes or 'properties' in that 'bag' of properties...

d) never sub-classing your PolymorphicEntity - but, instead, just adding new 'Properties' for different 'types' of objects by dumping more strings into the dictionary defining new 'attributes'.

e) and then pretending that you had a new, cool, way to create polymorphic objects where you could DYNAMICALLY change an object from one type to another by merely 'decorating' it with additional properties.

That's ROUGHLY the approach taken here with TPH. i.e., just dumping everything into a single 'bag' for persistence purposes just doesn't work. It becomes VERY problematic to maintain and extend, and sucks from a performance standpoint. (For example, with our 'PolymorphicEntity' in order to determine if your particular 'instance' has an is-a relationship (i.e. inheritance), you'd have to ITERATE over all the 'Properties' in that dictionary and see if one or more specified properties existed. Whereas, if you use interfaces and/or normal inheritance, you can INSTANTLY know if your instance implements or inherits a type/interface by asking the compiler with myInstance.IsSubClassOf(typeof(x)) and so on...

People would shake their heads at using a PolymorphicEntity in production because it would become a NIGHTMARE to maintain.

Merrill is making arguments against just such an approach - but within the realm of the database.

Yes TPH can be bad, but then so can water if you drink too much - should that stop you drinking water?

Your argument is great if and only if you consider using TPH a binary issue, fortunately it isn't. TPH works well in some scenarios and performs terribly in others, the key is to understand when it is the correct choice and when it isn't. Using an ORM is not about hiding the database from the developer, it's about bridging the gap between relational databases and objects to allow developers to leverage the power of both. To use an ORM correctly requires the developer to understand both sides of the ORM (object and database).

Just like no self respecting developer would create a PolymorphicEntity as described by Michael, no self respecting developer would advocate the use of TPH to map an inheritance hierarchy that spans some 16 types with a depth of maybe 3 ( yes, I am arguing the developers in that isv did not respect themselves or their code ).

I think the real problem you highlighted in your example is it is still far too easy for people to be hired as developers and then proceed to generate the kind of garbage that maps a large, poorly designed inheritance hierarchy into an equally poorly designed database schema.

Of course it is possible that only a few "nullable" column exist initially, but as the application and database evolves, more and more unrelated data will be grouped in one big table, that is its fate :)