Anti-Patterns and Malpractices, Volume 1: Tumbling Data

"Lonely but free I'll be found
Drifting along with the tumbling tumbleweeds"

- Supremes, "Tumbling Tumble Weeds"

Welcome to the first installment of what I hope will be a regular feature on this blog, Anti-Patterns and Malpractices. As a consultant, I get the honor of seeing a lot of different systems, with a lot of different code. Some of it is good, and some of it -- well -- I'll be featuring that which is not so good here. No names will be named, and code will be changed to protect the not-so-innocent; my goal is not to call out or embarrass anyone, but rather to expose those misguided patterns and practices which inevitably lead to problems (and a subsequent call to a consultant; perhaps if I post enough of these I'll have fewer less-than-appealing encounters in my work!)

The topic du jour is the Tumbling Data Anti-Pattern, a name coined by my friend Scott Diehl. Much like the tumbleweed lazily blowing around in the dust, data which exhibits this pattern is slowly and painstakingly moved from place to place, gaining little value along the way.

So what exactly typifies this particular anti-pattern? Consider the following block of T-SQL, designed to count all married employees in the AdventureWorks HumanResources.Employee table and bucket them into age ranges of 20-35 and 36-50, grouped by gender. Employees older than 50 should be disregarded:

--Remove the employees older than 50DELETEFROM #MarriedEmployeesWHERE EmployeeId NOT IN ( SELECT EmployeeId FROM #MarriedEmployees_20_35 ) AND EmployeeId NOT IN ( SELECT EmployeeId FROM #MarriedEmployees_36_50 )

--Get the final answerSELECT a.Gender, a.theCount AS [20 to 35], b.theCount AS [36 to 50]FROM #Employee_Gender_Count_20_35 aJOIN #Employee_Gender_Count_36_50 b ON b.Gender = a.Gender

This kind of code tells us several things about the person who wrote it. Rather than thinking upfront and designing a complete solution or at least a game plan before typing, this person appears to have thought through the problem at hand in a step-by-step manner, coding along the way. A bit of debugging was done along the way, but the real goal was to spit out an answer as quickly as possible (or so it seemed at the time). No attempt was made to go back and fix the extraneous code or do any cleanup; why bother, when we already have an answer?

It's important to mention that this is a simple example. I generally see this anti-pattern exploited when developers are tasked with producing large, complex reports against data sources that aren't quite as well-designed as they could be. In an attempt to preserve sanity, the developer codes each tiny data transformation in a separate statement, slowly morphing the data into the final form he wishes to output. The resultant scripts are often thousands of lines long and take hours to run, during which time the server crawls (and throughout the office you can hear people muttering "the server sure is slow today").

The solution to this problem is simple, of course, and the best software engineers do it automatically: Before writing a line of code sit back for just a moment and consider your end goal. Do you need to work in steps, or will a single query suffice? Can various join conditions and predicates be merged? Perhaps a Google search is a good idea; what is the best way to produce a histogram without a temp table?

The hurried atmosphere of many companies leads to a "get it done right now--even if it's far from perfect" attitude that ends up wasting a lot more time than it saves. The above example code block took me around 10 minutes to put together. A single-query version took me under two minutes to code. It is less than a third of the length, runs approximately 500 times faster, and uses 0.4% of the resources. All because I spent a couple of moments reflecting on where I was going before I took the first step.

If you find yourself exploiting this anti-pattern, step back and question whether this code will have a life beyond the current query window. If it will ever be checked into source control, it's probably a good idea to go back and do some cleanup.

If you find yourself tasked with maintaining code that looks like what I've posted, my suggestion is to simply re-write it from scratch. I was recently faced with a script containing over 2000 lines of this kind of thing, and I spent almost two days slowly working my way through the mess trying to make sense of it. On the evening of the second day, after talking with some of the shareholders, I realized that it was actually a simple problem to solve. One hour later and I had a new, totally functional solution -- a couple of hundred lines long, and several orders of magnitude faster. Sometimes it's best not to wade through a muddy puddle, when you can simply hop right over.

Comment Notification

Comments

I've noticed this pattern - I call it the data bucket pattern but it's the same idea - is popular among prior Access programmers who used lots of make-table queries to move the data to the desired answer. And I've seen some production databases that are set up this way - different table for the same data but different workflow states of the data, so the data moves from table to table in the lifecycle of the data.

I think people get emotionally attached to their code, they think it is their baby and feel hesitant to throw it out (understandibly so). They try to patch it here, patch it there until their code looks like a quilt. We have all done this, I have done it too. I spend 2 days on that code, I am not throwing it out, I know I can fix it.....

This is simillar to the fact that you have slowly changing dimensions in data warehousing, this thing changes too.....only for the worst. Every change you add makes it worse. You have to be bitten a couple of times until you realize that rewriting is the only option. Unfortunately it takes time (and money) until you realize that. Enough rambling, time for a Crown Royal with ice....:-)

Good example. I'll say, though, that this can be a good way for beginners to START to code. They might not have the skills to have thought out the whole thing beforehand in a set-based way. This is surely the way I started to write SQL waay back when I was first learning it. But the missing step for a beginner is this: REFACTOR!

Beginners: If you're learning, start out by gathering your datasets independently, then figure out how to join them together into a concise, high-performance query. Once you do do that a few times, you'll gain the skills to do as Adam says and think out more involved queries in just a few minutes. Don't just stop with the "Tumbling Data" pattern and call it "good enough".

I agree with everything in the post but I sometimes prefer it when something (reports?) is written in the Tumbling Data Anti-Pattern way! I often have to look at stored procs used by reporting services reports, the reports are complex beasts and the stored procs producing them typically consists of a single very complex SELECT statement.

These SELECT statements take time to understand and making a change to such an animal really can be a daunting task. I get a sinking feeling when I open one of these and have to make a change, on the other hand, when the stored proc was done using the Tumbling Data Anti-Pattern I can more easily understand how the original developer was thinking and subsequently my change becomes easier. Looking at little snippets and understanding them as you go are just easier IMHO.

I guess these complex SELECT statements can be documented and that would ease my frustrations. In an environment where complex reports are requested everyday and sometimes never looked at again that just isn’t going to happen.

Perhaps the Anti-Patterns is more self documenting than the alternative?

I agree that this kind of code CAN be easier to understand, but not necessarily so. I wish I could post a sample of some of the real code I've recently been working with so that you can see just how bad it can get. My re-written, single SELECT query is heavily commented and I guarantee that the next developer who sees it will only have to pull out 1/16 of the hair that I had to pull out while reading the gigantic temp table-based solution :-)

Plus, there's the performance issue. When it takes an hour to run each step because there are no proper predicates used, it can be very difficult to debug and/or figure out what's going on...

Credit for naming an anti-pattern - a dubious honor, but I'll take what I can get.

We can all agree that the tumbling data anti-pattern is inefficient; as Adam mentions, the unnecessary hammering of hardware resources can make you very unpopular with both end users and other developers sharing your environment.

The more subtle danger, however, is the habits of mind produced by repeatedly implementing such an approach to complex query logic. What happens when you have to extend such a query? Seems to make good sense, at the time, to fill up another temp table (or..egads...truncate and fill up some of those "reporting" tables you've left lying about), tacking the output onto the end of the existing statements. It's like query crack - just one more temp table, please - when a more effective, multi-module design might actually be both less complex AND more self-documenting.

I've seen a number reports based on this anti-pattern lately. When you see a query moving on to it's fifth temp table, after pumping output from one #NotAnotherOne to another like some out-of-control chain of Unix shell scripts, well, be very afraid. And talk to the developer about implementing a nice derived table, maybe.

This pattern is the old "scratch tape" model of electronic data processing (EDP). Tape drives were expensive, so small shops would have only two or three of them. You would hang a source tape on drive #1, and extract a subset to drive#2. Rewind #1 and either hang a new scratch tape or write a file separator on #2 and extract the next subset.

We had no physical choice about this with magnetic tapes. What I cannot figure out is why people who are not that old got the idea.

Great post, Adam. And even though many great SQL programmers indeed embrace the "set-based" concept of doing things all in one big step, that doesn't mean that we still don't need to stop before coding and really think about what we are trying to do, and the best way to get it done. I discuss that a bit here (it's an older post, but I am not sure if you've seen it):

Leave a Comment

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.