The bug went unnoticed in production for a few days until suddenly, the servers started pinning all their CPU's and shortly thereafter, depleting the memory of each instance. A restart was the only remedy.

This code runs a loop through someone's calendar and stops until we've gathered enough information. The maxDays variable is what I call a "safety valve", a condition used to guarantees that the code doesn't run amok in case its normal processing flow never finishes. And yet, the crash seems to indicate that the code was doing exactly that.

With this introduction out of the way, here is now the line that caused the bug:

Actually... you've already seen it, it's the code I pasted above. Did you spot the problem?

That's right, a simple set of parentheses in the wrong place. Because they are surrounding the wrong expression, they completely defeat the purpose of the safety valve, which only triggers for people with empty calendars (explaining why it took some time to trigger that bug).

Two characters to bring down an entire server...

Can we learn anything about this bug, besides the fact that testing needs to cover edge cases? A static analysis tool can't be of much help here since we're completely in the realm of runtime behavior, but is there some way that an automated tool could have warned me about the flaw of this termination expression?

This entry was posted on August 4, 2013, 1:46 pm and is filed under General. You can follow any responses to this entry through RSS 2.0.
Both comments and pings are currently closed.

13 Comments

A static analysis tool (more specifically, an extended static checker) could actually have been of considerable help here, by verifying (or, in this case, failing to verify) the correctness of the loop against a specified loop invariant and variant function…

I use Intellij (switched from Eclipse a few months back) and that allows you to format code, optimize imports and analyze code before checkin.
Eclipse can reformat code on save.
Both of these will allow you to force the use of brackets, clean imports, add final to declarations etc etc.
They won’t catch all errors but they do reduce the chances of getting the “slaps forehead” errors.

“A tool verifying an invariant would, by definition, not be a static one since it needs to understand (and possibly run) the code in order to emit diagnoses.”

It seems that you assign a different meaning to the phrase “by definition” than I do. Multiple tools exist that statically verify loop invariants and variant functions in Java (as well as other types of specification). Understanding the semantics of a piece of code does not require running the code.

In this particular case, it’s unclear whether the loop invariant would have been easy or difficult to devise because the comments you included don’t make it obvious exactly what the loop is trying to accomplish. However, the variant function would have been the helpful thing here, even if the invariant was something relatively trivial. An extended static checker could demonstrate the existence of an execution path where the loop does not terminate and give example values for the state variables entering the loop that produce that outcome. Even “regular” static checkers like FindBugs and PMD, which don’t require any additional annotations in the code, can detect certain kinds of infinite loops; actually, one of those might have also helped in this situation by flagging the odd use of parentheses in the faulty loop guard, even if it didn’t actually recognize the possible infinite loop.

A static analysis tool could have alerted you that you had an unnecessary parenthesis. Before deploying new code someone should understand each warning from the static analysis tool and perform a usually quick analysis on the risk. If it is team policy to keep your code dry and not have unnecessary imports, brackets, etc. then seeing a warning of such would have triggered investigation into this block of code and probably lead to catching the issue before being deployed.

This inspection reports on any instance of unnecessary parentheses. Parentheses are considered unnecessary if the evaluation order of an expression remains unchanged if the parentheses are removed.

In your code block the parentheses around “eventCount < 20" are, according to the above definition and static code analysis engine provided by IntelliJ, unnecessary. As you stated, the code is 100% valid but this is according to the compiler. My suggestion is that you make such code invalid according to your code style guidelines. I image that other tools in addition to IntelliJ can do this same analysis of the source code (e.g. lint) and report these as warnings or even errors if you want.

I think it all depends on what you’re going for with static analysis. What’s unnecessary from the compiler’s perspective may be useful or necessary from a readability perspective.

Back in the Stone Age when I started coding, and every byte and cycle counted, compiler efficiency was probably more important, with disk space a close second. But nowadays, I aim for readability – lots of whitespace, and enoug parens to make groupings clear to my aging eyes.

Your Code runs forever if eventCount or attendeeCount will not hit the borders.

parenthesis combined with operator precedence
1. () from left to right
2 < from left to rigth
3 && from left to right
4 || from left to right

while ( ( maxDays < 7 ) &&
// maxDays shall hit in all cases
( ( eventCount < 20 ) || ( attendeeCount < 40 ) )
// so this ^^^^ part has to be one part of the expression of the && at all.
)
{
// i would check also for integer overflow
// if the variables can get outside of this wile
// it is possible that they do overflow and this loop doesn't know about.
}