Monday, 12 November 2012

The Cost of Defensive Programming

They say that the first rule of optimisation is to measure. Actually they say “don’t do it” and the third rule is to measure, but let’s invoke artistic license and just get right on and measure. Another rule when it comes to optimising code is that intuition usually sucks and the bottleneck won’t be where you think it is. Even for experienced developers this still holds true.

My current system has a component that is not unit testable, in fact the only way to test it is to run the entire process and either debug it or compare its outputs. As a consequence the test feedback loop between changes is insanely high and if you throw some infrastructure performance problems into the mix too you can probably guess why I decided to let the profiler have a look at it during one lunch time. From the opening paragraph you can probably guess that what I found was not what I expected…

The code I stumbled upon probably deserves to be submitted to The Daily WTF because it almost manages to achieve functional correctness in a wonderfully obtuse way. It looked something like this:-

Hopefully you should be able to deduce from my attempt to disguise and distil the guilty code that it executes a query to load some data and then it builds a map of ID to some attribute value. The reason I say that it only “almost” works is because what you can’t see from it is that there is something fundamentally missing which is a parameter that defines a context (specifically a point in time) for which the data should be loaded.

However, before I get to the real subject of this post let’s just get the other obvious criticisms out of the way. Yes, the component has SQL statements embedded in it. And yes, the SELECT statement uses ‘*’ when it only requires two of the columns in the table. OK, let’s move on…

So, what prompted me to write this post is actually the use of the “upsert[1] style anti-pattern” in the loop at the end:-

if (!bugs.Contains(id)) bugs.Add(id, title); else bugs[id] = title;

This code is nasty. It says to me that the programmer is not sure what invariants they are expecting to handle or maintain within this method/class. What probably happened (and this is more evident if you know the data model) is that the Add() started throwing exceptions due to duplicates so it was silenced by turning it into an “Add or Update”[2]. Of course the knock-on effect is that the same value can now be silently updated many times and the observable outcome is the same. The addition of the ORDER BY clause in the SQL is what elevates it from non-deterministically broken to nearly always correct because it is never invoked in production in a manner where the tail-end rows are not the ones that matter.

Putting aside the embedded SQL for the moment, the correct implementation only needs to select the data for the context in question and there should only be one row per item so just doing the Add() is sufficient. The invariant in the data model is that for any point in time there is only one version of an item “active”. The only reason the Add() could throw is if the initial query is wrong, the data is broken or the invariant has changed and the code is now broken. If the latter happens you’ve got real problems and the least of them is that some code has failed very loudly and clearly.

Now to unwind the stack and return to the original performance problem. This code masks what should have been a query that returns 40,000 rows into one that actually sucks in 1,200,000 rows instead. Using the aforementioned “back of the envelope calculation” that is over an order of magnitude more rows than required. Luckily the table is fairly short on columns and so in terms of data retrieved it’s not insanely huge which is probably why it’s crept up largely unnoticed.

As an aside I mentioned that it’s never run retrospectively in production and that is another reason why it has remained hidden. It has however caused me to scratch my head on a few occasions during testing as I’ve observed an unanticipated change in the output but not had the time to pin down the exact cause. A bug that only appears occasionally during testing is never going to make it up the TODO list when there are far more pressing features to implement, and the costly way the component needs to be tested pretty much ensures that it will never get fixed in isolation - a double whammy.

[1] For those lucky enough to avoid databases an “upsert” is short for update-or-insert. You attempt an UPDATE and if no rows were modified you perform an INSERT instead. SQL Server has proper support for this idiom these days in the shape of the MERGE keyword.

[2] This “pattern” is sprinkled throughout the component’s code so it’s not an isolated case.