Wednesday, 11 May 2011

The Public Interface of a Database

The part of my recent ACCU London talk on database unit testing that generated the most chatter was the notion that there can be a formal public interface to it. Clearly we’re not talking about desktop databases such as Access, but the big iron products like SQL Server and Oracle. It is also in the Enterprise arena that this distinction is most sorely needed because it is too easy to bypass any Data Access Layer and directly hit the tables with tools like SQLCMD and BCP. I’m not suggesting that this is ever done with any malicious intent, on the contrary, it may well be done as a workaround or Tactical Fix[*] until a proper solution can be developed.

In any Enterprise there are often a multitude of in-house and external systems all connected and sharing data. Different consumers will transform that data into whatever form they need and successful systems can in turn find themselves becoming the publishers of data they only intended to consume because they can provide it in a more digestible form. The tricky part is putting some speed bumps in place to ensure that the development team can see when they have violated the design by opening up the internals to the outside world.

Building Abstractions

So what does this Public Interface look like? If you look at tables and stored procedures in a similar light to structs and functions in C, or classes and methods in C++/C# you naturally look for a way to stop direct access to any data members and this means stopping the caller performing a SELECT, INSERT, UPDATE or DELETE directly on the table. You also need to ensure that any implementation details remain private and do not leak out by clearly partitioning your code so that the clients can’t [accidentally] exploit them.

I’ll admit the analogy is far from perfect because there are significant differences between a table of data rows and a single instance of a class, but the point is to try and imagine what it would do for you if you could, and how you might be able to achieve some of the more desirable effects without sacrificing others such as productivity and/or performance. As you tighten the interface you will gain more flexibility in how you can implement a feature and more importantly open yourself up to the world of database refactoring (assuming you have a good unit test suite behind you) and even consider using TDD.

Stored Procedures

The most obvious conclusion to the encapsulation of logic is the blanket use of Stored Procedures and User Defined Functions so that all access is done using parameterised functions. These two constructs provide the most flexibility in the way that you design your abstractions because they allow for easy composition which becomes more essential as you start to acquire more and more logic. They also provide a good backdoor for when the unanticipated performance problems start to appear as you can often remediate a SQL query without disturbing the client.

Of course they are not a panacea, but they do provide a good fit for most circumstances and are a natural seam for writing unit tests and using TDD. The unit test aspect has probably been the biggest win for us because it has allowed us to develop and test the behaviour in isolation and then just plug it in knowing it works. It has also allowed us to refactor our data model more easily because we have the confidence that we are in control of it.

Views

Stored procedures are great when you have a very well defined piece of functionality to encapsulate, such as updating a record that has other side-effects that are inconsequential to the client, but they are more of a burden when it comes to querying data. If you try and encapsulate every read query using stored procedures you can easily end up with a sea of procedures all with subtly different names, or a single behemoth procedure that takes lots of nullable parameters and has a ‘where’ clause that no one wants to touch. Views solve this problem by giving the power back to client, but only insofar as to let them control the query predicates – the list of columns (and their names) should be carefully decided up front to avoid just exposing the base tables indirectly, as then you’re back to where you started.

User Defined Types

User defined types seem to have a bad name for themselves but I couldn’t find much literature on what the big issues really are[#][$]. I find them to be a very good way to create aliases for the primitive types in the same way as you would use ‘typedef’ in C/C++. A classic source of inconsistencies that I’ve seen in SQL code is where you see a varchar(100) used in some places and a varchar(25) in others as the developer just knows it has to be “big enough” and so picks an arbitrary size; it’s then left to the maintainer to puzzle the reason for the mismatch. UDTs allow your interface to be consistent in it’s use of types which makes comprehension easier and they can also be used by the client to ensure they pass compatible types across the boundary.

My current system started out on SQL Server 2005 which didn’t have the native Date and Time types so we created aliases for them. We managed to move to SQL Server 2008 before release and so could simply change the aliases. The system also has a number of different inputs and outputs that are not simple integers and so it’s important that we use the same scale and precision for the same type of value.

Schemas

Another tool in the box is Schemas. These are a way to partition your code, much like you would with namespaces (or packages), although they don’t provide the hierarchic behaviour that you get with, say, C# or C++. A very common database model is to separate the input (or staging) tables from the final ones whilst the data is validated and cleansed. One way to apply that might be to use the default ‘dbo’ schema for the production tables and use a separate ‘staging’ schema for the input tables. You can then overload table and object names without having to resort to ugly prefixes or suffixes.

You can of course go much further than this if you start out by treating the default ‘dbo’ schema as synonymous with ‘internal’ or ‘private’; then all objects in that default schema are considered as purely implementation details and not directly exposed to the client. You then create separate schemas for the major subsystems, such as a ‘backend’ schema for the server-side code and a ‘ui’ schema for the client. These will allow you to create multiple interfaces tailored for each subsystem that, if you use traditional functional composition, avoids duplicate code by just internally layering stored procedures. This natural layering also quickly highlights design problems, such as when you see an object in the ‘dbo’ schema referencing one in the ‘staging’ schema. There is often a natural flow, such as from staging to production, and schemas help you remain consistent about how you achieve that flow, i.e. push from staging to production or pull from staging by production.

Permissions

Ultimately if you want to deny access to something, just don’t grant it in the first place :-). Databases offer a fine-grained permissions system that allows you to grant access to specific objects such as individual stored procedures, functions and views. This means that your core definition of what constitutes the public interface is “anything you’ve been granted access to”. However it is not always obvious what may or may not be permissioned, and to what role, so the other mechanisms, such as schemas, can be used to provide a more visible means of scope.

Part of the permissions structure will be based around the roles the various clients play, such as server or UI, and so you will probably see a strong correlation between a role and the set of objects you then choose to expose in your various interfaces rather than just blindly exposing the same set for everyone. For example it is common to have a “Service Account” under which the back-end runs, this could be mapped to a specific Service Role which is then only granted permission to those objects it genuinely invokes which all exist in the specific Service schema. In contrast a Support Role may be granted access to all objects that only performs reads but have a bunch of special objects in a separate Support schema that can do some serious damage but are needed occasionally to fix certain problems. Different ‘public’ interfaces for different roles.

Oranges Are Not The Only Fruit

Jeff Atwood wrote a post for his Coding Horror blog way back in 2004 titled “Who Needs Stored Procedures, Anyways?” that rallies against the blanket use of stored procedures. One of his points is aimed at dispelling the performance myth of procedures, but as I wrote last year “Stored Procedures Are About More Than Just Performance”, but he’s right that embedded parameterised queries can perform equally well. Hopefully my comment about the use of views shows that I also feel the blanket use of procedures is not necessarily the best choice.

There are a lot of comments to his post, many of which promote the silver bullet that is an O/RM, such as Hibernate or Entity Framework. And that is why I was careful to describe the context in my opening paragraphs. If you’re working in a closed shop on an isolated system you can probably get away with a single Data Access Layer through which everything is channelled, but that is often not possible in a large corporation.

I would say that Jeff’s main point is that if you want to expose “A Service” then do just that, say, as a Web Service, so that you truly abstract away the persistent store. I agree with the sentiment, but corporate systems just don’t change technology stacks on a whim, they seem to try and drain every last ounce out of them. Hence I would question the real value in building (and more importantly maintaining) that proverbial “extra level of indirection” up front. Clearly one size does not fit all, but sadly many commenter's appear to fail to appreciate that. Heterogeneous systems are also wonderful in theory but when a company purposefully restricts itself to a small set of vendors and products what does it really buy you?

For me personally the comments that resonated most closely were the ones advocating the use of independent unit testing for the database code. I believe the ideal of using TDD to develop the database layer adds an interesting modern twist that should at least cause Jeff some pause for thought.

[*] The term tactical is almost always synonymous for quick-and-dirty, whereas strategic is seen as the right way to do it. However the number of tactical fixes and even entire tactical systems that have lived longer that many strategic solutions is huge. It just goes to show that growing systems organically is often a better way to approach the problem.

[#] The only real issue I’ve come across is that you can’t define a temporary table using UDTs on SQL Server – you have to add them to the model database or something which all sounds a bit ugly. I’ve not tried it out but I recently read that you can use “SELECT TOP(0) X,Y,Z INTO #Tmp FROM Table” or “SELECT X,Y,Z INTO #Tmp FROM Table WHERE 1 = 0” as a fast way of creating a temporary table based on a result set because the optimiser should know it doesn’t need to do any I/O. Describing your temporary table this way makes perfect sense as it avoids the need to specify any types or column names explicitly; so long as it incurs a minimal performance cost that is.

[$] Another more minor grumble seems to be to do with binding rules, in that they are effectively copied at table creation time instead of referencing the UDT master definition. You also can’t alias the XML type on SQL Server which would have been useful to us as we had performance concerns about using the XML type on a particular column. In the end we got close to what we wanted by using an alias to varchar(8000) with a rule that attempts a conversion of the value to XML – we could then just drop the rule if we noticed a performance problem in testing.

1 comment:

I agree with you about the need to encapsulate databases by having a 'public API'.

The debate around the use of sprocs does tend to get bogged down around issues such as security and performance. Whilst these are important, IMHO, the primary reason for using them (and the other techniques you have listed above) is to decouple a database from its clients - be they CRUD, ETL, reporting apps etc

Many client apps effectively have direct access to internal base tables. This, combined with other factors such as the limited use of db unit testing, limits the ability to independently change/extend dbs and often leads to taking a siloed approach and the proliferation of databases within an enterprise.