Dev Disasters -- Why They Happen, How to Overcome Them

We've all had them -- that moment when something you've done causes your head to forcefully meet your hand in horror. Or maybe the facepalm was brought on by someone else's incomprehensible screwup. Whatever the reason, mistakes are something everyone makes. What separates the everyday developer from the truly outstanding one is not the mistake -- it's what's learned from the mistake, and whether those lessons are applied to future projects.

What follows are tales of woe from Developer Land: monumental messes caused by inexperience, carelessness, arrogance and other failings. Whatever the cause, the effect was a program that didn't do what it was intended to do. Read on, and see what you can learn from the errors of others.

Or you could just laugh, instead.

Employees Must Wash Hands -- ­ and Take Salary Lists
It was my first programming job, and my first assignment. I was tasked with a new payroll report: salaried employees, by descending dollar value. So, being my first assignment, after I completed the development, I took the report home to verify it before I released it.

On my way out of the building, I stopped in the men's room and ... wait for it ... left the report on the sink with every employee's salary on it. Boy, do I wish reports were digital back then.

-- John Wollner

It's Not Always the Code
One of my first tasks upon being hired was working on a C# program to move images in bulk from a folder structure into a content management system (CMS). It took me a while to learn the ropes of the CMS, but with the help of my mentoring senior programmer, we got everything configured and working. Some months later, it was time to run this image import for one of our clients. It was high-pressure and had to be done by the next day, but it wasn't a big deal; the import had been working flawlessly.

So we got a ZIP of the latest files to be imported from the client, and I started the import process. To my great disappointment I saw that the main pages of the site were now missing images. I tried running it again with the same result. I asked my mentor, who had worked on it with me, to come help. We searched through the code, refactoring anything that remotely looked like it could be causing the problem. We ran the import on a few images and it succeeded. We ran the import on the rest of the images and it failed.

We kept this up for hours, trying every possible refactor we could think of. We debugged through the program line-by-line and found nothing suspicious. The code looked perfect. At 7 p.m. we were sick of looking at it, so we decided to catch our trains home. My mentor said he'd work on it from home after taking a break for a while.

The next morning I had an e-mail from my mentor saying he had found the issue. I opened the e-mail, still having no idea what could've been wrong with our image importer. What had been wrong? The images themselves. More than half the images had been corrupted when they were zipped on the client's server. The importer had been working perfectly, moving the corrupted images into the CMS and causing the browser to show them as missing.

At first I thought, "What a lot of work refactoring and debugging for nothing." But my mentor said I had learned a valuable lesson: As a programmer, it's good to question your code first, but I had been so focused on the code that I didn't think something as simple as a corrupted ZIP file could be making the perfectly functional program look bad.

-- Zachary Marks

A Thickheaded Update
For years now, users at every retail location, warehouse and distribution center accessed apps through the corporate intranet using their network ID; single sign-on took it from there. The Web client communicated with the thick client on the back-end back at corporate, and it just worked -- that is, until it didn't. It was left up to Bill G., a developer assigned to supporting the application server, to fix it.

For reasons unknown to Bill, the newest release of the thick client now required every user of the Web client to have an e-mail address (in the form id@domain.tld, according to the error message in the logs).

Could Bill roll back the patch? Nope. Restore to a previous backup? No way. Doing so would mean the loss of a week's worth of data. Pressing on was the only option.

So, Bill dutifully created fake e-mail addresses for the 17,000-plus employees who both needed Web access to this application and didn't already have or need an e-mail address (most locations just relied on a bulletin board for important notices). He uploaded the addresses to the application's Employee table, leaving the other 59 columns alone.

When he clicked Enable Web Access in the thick client, it crashed after a minute or so, citing an "out of resources" message from the database server. Because the database server was on a 24-core, 64GB system, "out of resources" was not something that Bill had seen a lot.

Out of options, Bill entered a critical service request with the vendor's support team. The response was quick, asking him to run a SQL trace to help get to the bottom of things. At this point, SQL Server Profiler though threw up its hands, saying "SQL Profiler trace skipped records," either because the server was too busy to push trace messages, or because there wasn't enough bandwidth for SQL Profiler to capture all of them.

So Bill decided to trace Enable Web Access for a smaller group: a single retail location with eight employees. It worked.

At this point, Bill's mind whirled with ideas of what could've caused the disaster. A cross-join against a large table, perhaps? Bad criteria, causing it to pull unnecessary records? Indiscriminate column selection, causing it to use too much memory? Data-type issues, causing needless implicit conversions?

"Why can't it be all of those?" Bill thought, after seeing what SQL Profiler found:

The Case of the Micro(brained)manager
I typically get brought in to clean up dev disasters, so I've seen plenty of problems. One thing I usually find while fixing these applications is that there's plenty of blame to go around -- no one involved with an application is blameless. Some of the problems include:

Customers aren't able to articulate what they want.

Customers pay so little that the services provider can't make a profit, which takes all the motivation out of the work.

A services provider will underbid in hopes of building a relationship with a customer. I actually sat in a meeting where another company's representative said, "We quote a low price, then when the money runs out, they have to negotiate with us because they're stuck with us. Then we can get them on price."

But none of this -- or my then-19 years of experience -- prepared me for one particular job. I got a call one day from a startup that wanted us to build a mapping services application from scratch. I met with them, and our company was on board. I had reservations regarding their business model from the start, but felt like they wouldn't have gotten the kind of angel investor funding they did if they hadn't thoroughly reviewed the marketplace and projected they were going to be successful.

After getting a couple of weeks in, the subject of how to handle the various "Fort" names came up. For example, there shouldn't be a difference between Fort Worth, Texas, and Ft. Worth, Texas, and similar types of names.

The manager at the startup wanted to do a really complicated rules-based engine to perform decoding. I suggested a fairly simple database lookup that would be easy to implement and easy to update, as all we had to do was to update database entries. The response was, "No, we can't do that, it just won't work because ..." And some flimsy explanation followed. I remember looking at him, thinking, "Really, that's your best excuse?" A few hours later, I had this resolved my way.

Later on, a search question came up: "How should users search for places on their map?" I said that Google had already solved this problem, so we needed to implement a single textbox -- and that if we didn't emulate that, no one would ever use the app. But the manager didn't like the fact that Google would be prone to errors on the fourth or fifth page of results. He said we needed to do better.

A major argument ensued over this, which I lost. The result was that a really complicated UI with six options would be used by the user, and then a search would be performed. The users would always get back the correct data on the fifth or sixth page, but no one could figure out the UI.

There were other examples of situations like this: We'd ask questions, and the manager would try to tell us what our options were -- and then react negatively toward all the options. The end result was that important decisions were never made.

As you can well imagine, consumers were not receptive to this product. We tried to explain the problems, but the manager would never listen. Eventually, the startup's money ran out. We were asked to keep working, with the guarantee that eventually its investor would pay us. As of now, we've been waiting on that final check for five years -- I don't think we'll ever see it.

This experience caused me to formulate a set of guidelines for successful development:

Non-technical people aren't allowed to play programmer. Non-technical people aren't allowed to make technology decisions. I've since had this lesson reinforced.

Don't allow meetings to adjourn without decisions being made.

Listen to the end user. Before everything fell apart, the investor asked me to sit down with some users and see if I could figure out what they wanted. I spent a few days talking to users and found out that they wanted something similar to Foursquare -- only this was 18 months before Foursquare launched.

Throughout this nightmare, I was yelled at by the manager for not wanting to follow "his plan." That plan apparently involved ignoring user feedback. After all, how could those crazy users tell us anything about how to design a product?

-- Wallace McClure

The Quantum Invoice Bug
Like most in-house-written tools, the Initrode invoice tracking and management system had started out simple and lean -- but over time, it had grown beyond its original intent, morphing into a mash-up of Classic ASP, J#, C#, Visual Basic, ASP.NET and, of course, static HTML. So, when the announcement was made that time would be dedicated to a Code Cleanup Marathon, Andrew G., like everyone else on his team, was ecstatic. The system would be gutted and recreated in C# and ASP.NET. It would be a return to vanilla.

The general consensus was that the coming promotion was going to go well, but Andrew had one bug that resisted squashing.

After refactoring an old ASP.NET control to use Model-View-Presenter-style binding, QA reported an odd condition whereby a user browsing to the View My Invoices page would automatically delete their first uploaded invoice.

Andrew dubbed this bug "The Quantum Invoice Bug": The act of viewing the invoice caused it to no longer exist (Schrödinger would be proud).

Somehow, changing the order of the data binding was causing the QueryString of the current request to change, appending a "delete=123456" key-value pair to the end.

The function was supposed to build a URL based on the current URL, with various QueryString bits appended for various actions that could be invoked by clicking links on a page.

The System.Uri class was (more or less) immutable -- you would create a new one if you needed to alter the represented URI. However, someone figured out a way to circumvent the encapsulation in this case, in order to modify the current URI rather than building a new one.

In and of itself, this was a rather strange approach to programming. But the last couple of lines of code really took the cake.

After forcing the QueryString collection of the currently processing URL to be non-readonly, the readonly status was carefully restored after having been modified. Then the URL was converted to a string and returned from the function. The result was that whichever "action URL" was rendered last by the rendering logic became the action for the current request.

Everyone asked: "Was this part of the new code?" The answer, as Andrew came to learn, was "no" -- the code was installed into production almost six years earlier, and was skipped during review because it was already in the target language.

As to the reason why it was missed previously, it was because ... well, it wasn't.

In the user documentation regarding the View My Invoices page, the instructions were clear: "Please disregard the View My Invoices menu item. Instead, to view a recently uploaded invoice, click on Edit My Invoices and then navigate from there."

-- Mark Bowytz

Work One Week, Get Paid for Two
I was a hero to all of the hourly employees at the refinery I worked at -- for exactly two weeks.

Way back in 1988, I wrote and implemented a system that let hourly workers at a refinery for a major oil company enter their time online.

The pilot rollout for select staff went great: Not only did the employees enter their time during the week, every Friday we ran a batch program that bundled up the electronic timecards and sent them off to the head office to pay those employees.

I probably spent more time on the audit/hash/crosscheck report generated by the batch program that showed the timecard data sent to the head office was correct than I did on any other part of the application. The payroll staff at the refinery reviewed the report before approving the timecards for payment.

We had one other check in the system: The head office didn't actually produce the paychecks until end-of-day Tuesday, so that we'd have Monday and Tuesday to find and fix any problems.

The pilot was such a success that the company decided to roll it out to all of the hourly employees. Of course, we needed to make sure employees weren't paid through both the new electronic system and the old paper system. So I inserted a test to check that timecards entered before the implementation date were ignored. Well, that would've been the test had I entered " <=" rather than " >=".

Anyway, the next Friday, that backward test caused the system to pay all the hourly employees twice. The audit reports were fine: the data was internally consistent. There was, however, no automated "reasonableness" check to flag that the total amount being paid was twice what it should be.

That reasonableness check was done by the person in payroll who normally checked the report -- and who was on vacation that week. The replacement had no idea what a "normal" amount was and, because the crosschecks were OK, approved the timecards for payment. Off the timecards went to the head office.

I came in Monday, by which time the normal payroll staff was back at work and the problem had been noticed. I realized that this wasn't a problem: We just needed to produce a correct batch before the paychecks were printed on Tuesday.

I spent most of Monday preparing the correct batch -- until my boss came in and told me that it was already too late; the paychecks had been printed and mailed over the weekend. It turned out that, because the pilot system never had a problem, the head office had decided to cut out the delay between receiving our timecards and printing the paychecks.

So everyone in the refinery got a whopping paycheck that week. I was a hero to the hourly employees -- until the next week, when the extra amount was deducted from their paychecks. I understand that, with overtime pay and other bonuses, some staff members didn't pay off the windfall from my backward test until the second paycheck after the mistaken one.

The sequel has a happier ending. It turned out that on Friday, before going home, every employee ran the same query to check to see if they'd entered all their hours for the week. While an entirely foreseeable result, we'd provided no index on the timecard table to support searching by date range and employee number (though we did have an index supporting foremen checking the hours entered for a date range for their crew).

As the size of the timecard table increased, the average response time for the system crept up from about 1.5 seconds to just under 3 seconds -- response time on Fridays was averaging 6 seconds by the time we decided to do something about it. When we finally figured out the problem, we slapped the appropriate index on the table and response times dropped back to their old levels.

-- Peter Vogel

Garbage In, Garbage Out
Managed languages don't leak memory. At least, that was one of their value propositions when they emerged many years ago. In theory, a garbage collector takes care of littered objects so developers can focus on coding the actual application. That's the theory, at least.

Years ago, I was hired as a software architect at a financial group for a Java Platform Enterprise Edition, or Java EE, application. The app was deployed in countries where the group operated.

One of those countries consulted me because server performance monitors showed the application gobbling memory to the degree that a restart was the least-bad choice. Initially, system administrators reset the servers at the first sign of a scalability issue. This happened twice a day, per server, at most. Not a big deal, but they wanted to avoid it.

I met with all the developer leads and some of their developers. We reviewed possible use cases based on a number of criteria: instance creation, instance duration, use-case frequency, concurrency and any combination of these. But theories failed one after the other as well-instrumented tests showed diminishing memory counters at the end. Stress tests indicated that the breaking point was above the daily maximum levels.

In the meantime, the application grew larger and messier, and the users started stressing my farm. Consequently, the need for resets soon spiked to three or more times daily. It was hard to tell if the increase was due to more users, less availability of memory or both. My abilities as an architect came into question.

Feeling the pressure, I decided to force through code some explicit calls to the garbage collector. But that caused more damage -- not only did the leaked memory stay lost, but it unnecessarily pushed otherwise short-lived objects to the long-lived object area, whose collection frequency was lower.

Nearing the end of my credibility, I finally solved the puzzle. The "Aha!" moment came as the result of a lengthy discarding exercise. I'd gone through all my guesses about the garbage collector as a memory leaker. The only remaining option was to assume the collector wasn't leaking; instead, the memory was possibly being held because it stayed referenced somewhere in the application.

Bingo. I discovered that some session-bound info stored in an in-memory application graph stayed put, even when sessions ended or expired. Existing stress tests hadn't detected this because they stressed the overall capacity from a few hyperactive sessions.

The moral of the story? Java developers who leave unneeded instances referenced are just as bad as C++ developers who leave unreleased heap instances. Garbage collectors help eliminate plumbing -- but don't expect them to fix your bugs.

-- Diego Dagum

Small Errors, Big Problems
In my new job I was responsible for an application where, among other options, users could enter a date into a field used in our month-end processing. Everyone knew that the format for this field was month/day/year -- everyone except new hires, or someone who forgot and so on.

In the first six months I was on this job, the date was entered incorrectly twice. The result was that staff had to go in and manually correct the bad data.

I decided to add some edit checks around this field, and figured it would take about half a day to do a good job. First, however, I checked with the two staff members who'd written the original program (known jointly as "Rose 'n' Phil").

They had retired a couple years back, but returned to work as contractors. Their advice was to not bother with the problem, that it wasn't a big deal and that I should just tell the staff to "be more careful." Because these two had the ear of the head of IT, I figured I wouldn't get approval for this time unless I could justify it -- and back then, I was too green to just go ahead and do the job anyway on the "easier to get forgiveness than permission" theory. So I tracked down all the people who had to do some work to fix the bad data when someone entered a date in the wrong format.

It took four-and-a-half days. And five people. Every time it happened.